Physical Intelligence's π0.7 robot learns tasks it never trained on

The training data for Physical Intelligence's new robot model contained exactly two relevant clips involving an air fryer: one where a different robot pushed the appliance closed, and one where a robot placed a plastic bottle inside it. Neither clip involved cooking anything. When researchers gave the model the instruction "cook a sweet potato," it failed. Then they walked it through the steps in plain language, and it worked.

That jump, from 5% to 95% success, is the story. But it isn't the whole story.

What π0.7 actually is

Physical Intelligence, a two-year-old San Francisco startup that has raised over $1 billion and was last formally valued at $5.6 billion, published research on April 16, 2026 introducing π0.7, a generalist robot model. The core claim is what the company calls compositional generalization: the ability to stitch together fragments of unrelated training data, combine them with broad web-scale pretraining, and arrive at a functional understanding of something the model has essentially never encountered.

The air fryer is the clearest demonstration of what that means in practice. The model had no direct training on cooking with one. It had tangential footage, a closing motion and an object placement, and whatever it had absorbed from the wider world during pretraining. Given a task with no coaching, it failed. Given step-by-step verbal instructions, it succeeded at a rate that jumped from 5% to 95% after researchers spent roughly half an hour refining how the task was described.

The quieter result

The air fryer makes a good headline. The more significant finding is something else.

A single π0.7 model, with no task-specific fine-tuning, matches the performance of Physical Intelligence's previous specialist models across laundry folding, espresso preparation, and box assembly. Those specialists, called π*0.6, were trained using a dedicated reinforcement learning algorithm called RECAP, designed specifically to maximize performance on individual tasks. Each one was hand-tuned. The generalist beat them all without any of that.

The gap between generalist and specialist is now zero. That's a threshold the robotics field has been chasing for years, and most researchers expected it to take several more model generations to close.

The threshold language

Co-founder Sergey Levine, who is also a UC Berkeley professor focused on AI for robotics, draws an explicit comparison to what happened with large language models:

"Once it crosses that threshold where it goes from only doing exactly the stuff that you collect the data for to actually remixing things in new ways, the capabilities are going up more than linearly with the amount of data."

That phrasing, "more than linearly," is the part worth paying attention to. It's the same pattern language model researchers used when scaling started compounding in unexpected ways. The claim is that the returns on data are no longer proportional. More data produces disproportionately more capability. If that's true in robotics, the field is at an earlier version of a curve people already know the shape of.

Research scientist Ashwin Balakrishna, who helped build the model, described the shift in personal terms: "My experience has always been that when I deeply know what's in the data, I can kind of just guess what the model will be able to do. I'm rarely surprised. But the last few months have been the first time where I'm genuinely surprised."

He bought a random gear set, asked the robot to rotate a gear, and it worked. Nobody trained it on that.

What's still fragile

The researchers are careful not to oversell it, and the honesty is useful.

Prompting has to be precise. The wrong wording on a task instruction doesn't just reduce performance, it can drop success rates back toward the floor. The knowledge inside the model is diffuse, assembled from fragments nobody can fully trace, and nobody on the team can predict with confidence where it will hold and where it will quietly break.

Lucy Shi, a Physical Intelligence researcher and Stanford computer science PhD student, put it plainly: "Sometimes the failure mode is not on the robot or on the model. It's on us. Not being good at prompt engineering."

That's a meaningful admission. The model's generalization depends on how well a human can articulate a task. The robot figured out the air fryer, but only because a researcher spent thirty minutes finding the right words.

Levine is also clear about the current ceiling: "You can't tell it, 'Hey, go make me some toast.'" The tasks demonstrated are still relatively structured, and the company acknowledges that criticism. But they also published the numbers, and the numbers are hard to dismiss.

Why it matters

Robotics has had a version of this conversation before. Every few years there's a demo that looks like an inflection point and turns out to be a carefully constructed exception. The skepticism is earned.

What's different here is that the generalist-specialist parity result is a direct, measurable comparison against the company's own prior work, not against an abstract benchmark. Physical Intelligence tuned those specialist models specifically to perform well. The generalist matched them without that tuning. That's not a demo. That's a controlled comparison.

The company is now reportedly in discussions for a new funding round that would value it near $11 billion, roughly double its last formal valuation. Investors appear to be reading the same paper.

The standard objection to every robotics milestone is that it happened in a lab, on a task that was designed to succeed, with humans watching. That objection still applies here. But the researchers themselves say they're being surprised by what the model can do. When the people who built something can no longer predict its behavior from first principles, something has changed.

If a robot can figure out an air fryer from two unrelated training clips, the question worth sitting with isn't whether this is impressive. It's which task you currently assume still requires a human that quietly falls next, and whether you'd even notice when it did.

Originally published as an Instagram carousel on @recul.ai.

What π0.7 actually is

The quieter result

The threshold language

What's still fragile

Why it matters

More from Recul

Ukraine took a Russian position without a single soldier. Here is why that matters more than it sounds.

A humanoid robot learned tennis in five hours of amateur motion capture

GPT-5.4 Pro cracked one Erdős problem. The point is the method cracked a second one.

Mayo Clinic’s AI found the cancer signal three years before diagnosis