The State of Robot Learning
A partially observed, semi-stochastic, egocentric view.
This article is a companion to my upcoming talk at Nvidia GTC, with a sprinkle of spice on top. These are my personal views, not Google DeepMind’s, and don’t necessarily reflect the diversity of perspectives from my team and colleagues.
The buzz at the most recent Conference on Robot Learning, which took place six months ago in Atlanta as of my writing, was noticeably different from any other installment I have attended since its inception seven years ago. There was a sense of change in the air. Many presentations showcased robot systems that actually … kind of worked! Albeit within the confines of certain academic definitions of “working.” Where the community had previously struggled on simple tasks akin to stacking red blocks on top of blue blocks, we were now seeing systems that exhibited real progress on complex, real-world problems.
I often joke that a career in research is about committing to a lifetime of working on things that don’t work. And during hallway conversations between sessions, you could hear researchers asking themselves: “Now what?” — Not because things were actually working well enough that we anyone would call “mission accomplished” on the entire enterprise. Rather, it reflected a realization that the pace of progress in robotics had accelerated, and called for a reevaluation of both the research directions and the methodologies employed.
How did we get there? Well, like all things AI we can trace the shift back to GPT and the dawn of modern LLMs circa 2021. Suddenly, unprecedented reasoning capabilities seemed to be at everyone’s fingertips, and AGI was around the corner. Something else happened to the world of robotics around that time, much less reported in the popular press: massively increased levels of FOMO. Robotics, or “Embodied AI” in its more fashionable garb, was supposed to be the path to AGI, the one true solution to situated intelligence, real-world grounding, and commonsense reasoning. Seeing “Language Modeling,” this obscure subfield that the NLP community had arguably left languishing for a decade, suddenly hogging the spotlight was a gut punch to the community.
Of course, if you can’t beat them, join them, and so we did. That line of research, “robotics meets LLMs,” could have turned out extremely shallow: maybe you could use a language model to talk to your robot; or have your robot recite Klingon poetry. Big deal. But what actually happened would turn out to be the biggest surprise of my career: the connections ran deep, so deep in fact that to this day we’ve only started to scratch the surface.
It is a common mistake to think of LLMs as being about “language.” Language is indeed the surface form they primarily use (as well as code) but the superpower of an LLM is all about commonsense reasoning: LLMs know simple truths such as “a book belongs on a shelf, not in a bathtub,” or “how one would go about making coffee,” and that turns out to be a big deal for embodied agents trying to act in the real world. So it’s no surprise that the first corner of robotics to be affected by LLMs would be planning.
It may help to draw a diagram of how a robot works conceptually. I am sure that I won’t make many friends in the community by (over-)simplifying things to this degree, but in the spirit of “all models are wrong, but some are useful,” here is one I find particularly illuminating. Picture a loop where your robot perceives the state of the world, sends that state to a planner, which, in combination with a stated goal, writes a plan to achieve it. That plan gets sent to a robot controller, which is in charge of executing it by actuating the hardware. Of course, the world keeps on changing, so maybe only the first steps of that plan get executed, until the state estimate gets updated, the robot replans, executes the next steps, and so on.
This is a very rough sketch that has the merit of mapping on to the major subfields in robotics: state estimation, task and motion planning, and control. Each has traditionally developed in parallel, and many would argue that this resulted in system-level issues being largely ignored and problems being tossed around like a hot potato across fields: too many TAMP papers taking perfect state estimation as a given, too many control strategies happy to summarily reject a plan they can’t execute — and don’t get me started on trying to get gradients to flow across those boundaries!
So of course the first disruption came for the planner. I am probably biased, but I’ll single out SayCan as the “aha” moment where the community realized that if we lifted much of planning into “semantic space,” as opposed to trying to do it in geometric space, we could use LLMs to do the job, and inherit all the benefits of their common sense abilities without having to collect any data, craft robot-specific ontologies or build symbolic reasoning engines.
Once your interfaces to both the perception and actuation use natural language, it’s very tempting to use language as the de-facto API for everything. And language has a lot going for it: it’s flexible, it’s interpretable, it can describe things at any level of abstraction you choose. The last of which was a huge problem for fixed APIs: it may be ok for the world to look like a collection of bounding boxes to a self-driving car for instance, but as soon as you’re actually trying to make contact with objects, you probably want richer geometry and semantics. Maybe the planner doesn’t know in advance what information would be useful for the perception module to provide. Maybe it should be a two-way conversation …
Which brings me to the next stage of that journey: let’s have the planner and the perception system both use natural language. Those VLMs are getting really good, let’s leverage them and make that two-way conversation an actual dialog. This is the idea behind Socratic Models, where a consensus about what the state of the world, and how to act on it, can be achieved through dialog between models. Inner Monologue took the concept further by making periodic state reestimation and replanning part of that conversation.
Once you have an LLM as the backbone of your robot intelligence, you can do lots of new things. For instance, with AutoRT, we used the LLM to dream up new tasks to perform, which immediately raised the question: if the robot comes up with its own things to do, how do we ensure they’re safe and beneficial? We can prompt the LLM with specific concepts of safety (“don’t pick up sharp objects”) or even more general human-centered values: “you may not injure a human being …” Sound familiar? If you had told me a couple of years ago that we would have a not-entirely implausible path to implementing Asimov’s laws of robotics on an actual robot, I would not have believed you. Time will tell if using Constitutional AI as part of a robot’s safety stack is practical, but the fact that we can even talk about it and evaluate it in the real world is unprecedented.
What about the actuation component? Can we chip at that final bastion of traditional robotics and give it the LLM treatment as well? One thing LLMs are really good at is generating code, and at the end of the day, controller software is just that: a piece of code describing a policy. This is where the concept of Code as Policies comes in, the idea that you can prompt an LLM with low-level control APIs and let it describe the actual policy to execute.
Code LMs do that zero-shot very well, provided you’re proficient in the dark arts of prompt design. You can use a dialog strategy to iterate on prompts, as was done in early ChatGPT for Robotics experiments from our colleagues at Microsoft. But it’s even better when you can close the loop and fine-tune your Code LM based on interactive prompting and refinement of your control behaviors. This is something we dubbed Language Model Predictive Control by analogy to classic Model Predictive Control. Not only does this enable models to perform better zero-shot on novel tasks, but it also enables them to learn faster from user interactions.
Once this picture of various LLMs engaged in an internal chat room emerges as the backbone of the robot, it is fair to question if the factorization of the problem into three components with distinct roles is still useful: neural networks can communicate with each other through high-bandwidth, differentiable representations, why force them to collapse them down to words? You do gain some degree of interpretability, but the information loss is significant. Picture for instance that your planner is still essentially blind. Can we merge some of these components? Not out of end-to-end zealotry, but because there is already inherent modularity in these models, thanks to the way different components of a transformer attend to each other, allowing us to reproduce this ‘separation of concerns’ within the neural network itself.
The first experiment in this direction attempted to merge perception and planning with PaLM-E.
It worked remarkably well, in that there were clear observed gains from joint training of the perception and planning modules, as well as evidence of transfer across tasks and embodiments.
The next experiment was to merge perception and actuation, while leaving the planner (which needs to be a large enough model to be useful) out of the loop. Many papers have come up with “pixels to action” models, but for us the transformative approach was RT-1. So much has been happening in recent months in this space, with approaches like Action Diffusion from TRI, Humanoid Transformer from Berkeley, ACT and Octo from Stanford, it has been wonderful to see the explosion in performance and capabilities.
By now you see where this is going: why stick to half measures when you can go all the way and train a single “robot brain” that does everything. Our first attempt at this was RT-2, and it really demonstrated how much being able to reason jointly about the entire problem, while still leveraging non-robotics data sources (internet data for perception and semantic understanding in particular,) benefited the end result. VC-1 from our colleagues at Meta was another notable step in that general direction. I expect many more in the future now that the open-source ecosystem for multimodal models is blossoming, and people are exploring how far we can push these model’s spatial reasoning abilities.
Where do we go from here? The semi-supervised revolution has come and gone. RL is still licking its wounds, with Diffusion Models giving it a good run for its money and Offline RL mounting a comeback. It’s clear that despite marked improvements in data efficiency, which enabled us to envision operating “in the real,” as opposed to having to bet everything on sim-to-real approaches, we’re still data bound, and improving the efficiency and diversity of data collection is key.
There lies the biggest tension in the field today: on one hand, we’re seeing that cross-embodied models work amazingly at transferring capabilities across robots, which argues for dialing up the variety of robots, tasks, and taking a diversity-focused approach to the problem. On the other hand, more and more people are clamoring for the “one embodiment to rule them all” approach to robot learning, with literally billion dollars of investment going into generalist robots such as Tesla, Figure, 1X, Agility, Unitree, Sanctuary, Apptronik and many others. The upside of that latter approach is universal capabilities that are particularly suited to learn from human embodiments and be deployed in human spaces. The downside is that you end up putting building, deploying, and maintaining a very complex and expensive piece of hardware in the critical path of scaling your data collection, and significantly raising the bar for the economics of the end product to make sense.
That’s a risky bet, particularly if the cross-embodiment hypothesis turns out to be true and a swarm of cheap robots, as exemplified by experiments like Aloha, UMI, Stretch (and Dobb.E,) Mobile Aloha and Aloha 2, could completely disrupt the field and bring dexterous capabilities to a wide variety of cheaper embodiments. But I would not put my money 100% on either side of that bet at this point: the months ahead are going to be critical to informing where the field should go: diversity vs universality, cheap and scrappy vs high-DOF and full-featured. The world we’re in today is a bit of both: “universal” arms are widely deployed, but the economics of making them worthwhile are brutal, and they generally require bespoke tooling and systems integration to make them actually useful.
One conviction I am developing in the context of data scaling is that generative models are the future of simulation, and that instead of just “better simulators” we should also be chasing how to make generative 3D and video respect not only appearance, but physics and spatial relationships. Every other class of physics simulator out there is being disrupted by generative models, from weather forecasting to protein folding. If instead of having to spend hours authoring a digital replica of my robot and environment, I could capture my robot’s sensors and simply press “play” to generate possible futures, as we did for instance in Video Language Planning, we could solve both the “sim to real” gap and scene authoring problem in one fell swoop.
I am also becoming increasingly convinced that data scaling is largely turning into an HRI problem. Perhaps not the exact problem the HRI community tends to worry about today, however; most of HRI concerns itself with end-users interacting with robots. We need better HRI approaches to designing new and diverse tasks, robustly collecting data, and refining behaviors: improving and streamlining all the careful design that is required to collect data to train models and behaviors, even before an end user even begins to interact with the final product. And we’ve just given the HRI community a golden opportunity to break new ground, by essentially turning every robot out there into a chatbot with eyes, arms, and legs.