Google outlines new methods for training robots with video and large language models

2024 is going to be a huge year for the cross-section of generative AI/large foundational models and robotics. There’s a lot of excitement swirling around the potential for various applications, ranging from learning to product design. Google’s DeepMind Robotics researchers are one of a number of teams exploring the space’s potential. In a blog post today, the team is highlighting ongoing research designed to give robotics a better understanding of precisely what it is we humans want out of them.

Traditionally, robots have focused on doing a singular task repeatedly for the course of their life. Single-purpose robots tend to be very good at that one thing, but even they run into difficulty when changes or errors are unintentionally introduced to the proceedings.

The newly announced AutoRT is designed to harness large foundational models, to a number of different ends. In a standard example given by the DeepMind team, the system begins by leveraging a Visual Language Model (VLM) for better situational awareness. AutoRT is capable of managing a fleet of robots working in tandem and equipped with cameras to get a layout of their environment and the object within it.

A large language model, meanwhile, suggests tasks that can be accomplished by the hardware, including its end effector. LLMs are understood by many to be the key to unlocking robotics that effectively understand more natural language commands, reducing the need for hard-coding skills.

The system has already been tested quite a bit over the past seven or so months. AutoRT is capable of orchestrating up to 20 robots at once and a total of 52 different devices. All told, DeepMind has collected some 77,000 trials, including more than 6,000 tasks.

Also new from the team is RT-Trajectory, which leverages video input for robotic learning. Plenty of teams are exploring the use of YouTube videos as a method to train robots at scale, but RT-Trajectory adds an interesting layer, overlaying a two-dimension sketch of the arm in action over the video.

The team notes, “these trajectories, in the form of RGB images, provide low-level, practical visual hints to the model as it learns its robot-control policies.”

DeepMind says the training had double the success rate of its RT-2 training, at 63% compared to 29%, while testing 41 tasks.

“RT-Trajectory makes use of the rich robotic-motion information that is present in all robot datasets, but currently under-utilized,” the team notes. “RT-Trajectory not only represents another step along the road to building robots able to move with efficient accuracy in novel situations, but also unlocking knowledge from existing datasets.”