Forget AI that can draw pictures, Google’s latest AI model can control a robot.
On Friday, Google introduced Robotics Transformer 2 (RT2), a vision-language-action (VLA) model that can take text and images and output them into robotic actions.
“Just like language models are trained on text from the web to learn general ideas and concepts, RT-2 transfers knowledge from web data to inform robot behavior,” Vincent Vanhoucke, Head of Robotics for Google DeepMind, explains in a blog post. “In other words, RT-2 can speak robot.”
Vanhoucke says that while chatbots can be trained by feeding them information about a topic, robots need to take things a step further and get “grounding” in the real world. The example he provides is a red apple. While you could simply explain to a chatbot what an apple is, a robot will need to know everything about it as well as how to distinguish it from a similar item—for instance a red ball—and they’ll also have to learn how they should pick that apple up.
RT-2 takes things a step further than Google's RT-1 and other models by using data from the web. For instance, if you wanted a previous model to throw something away you would need to train them on what trash is and how to use it. With RT-2, maybe you haven’t explained what the trash is and how to use it, but the robot can figure that part out on its own using web data.
With RT-2, robots are able to learn and take learned knowledge and apply it to future situations. That said, Google notes that in the current form, limitations mean RT-2 can only help a robot get better at physical tasks it already knows how to do, not learn them from scratch.
Still, it’s a huge step forward and shows us what might be possible in the future. For more, Google goes into detail on how RT-2 works on its DeepMind blog.


