Using Natural Language in Imitation Learning to Instruct Robots

Mariano_Phielipp · ‎05-16-2022

Published February 1st, 2021

Mariano Phielipp is an industry expert at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.

Highlights:

Researchers from Arizona State University, Intel AI Labs, and Oregon State University used unstructured natural language for imitation learning in manipulation tasks, providing a communication channel between the human expert and the robot.
In the future, incorporating natural language into imitation learning could decrease programming needed for autonomous robots, enabling a natural interaction between humans and robots.

[video]

It's a challenge for robotic arms to perform manipulation tasks when given natural language instructions in the everyday language that humans use to communicate with each other. Our team of researchers from Arizona State University, Intel AI Labs, and Oregon State University used language as a flexible goal specification for imitation learning (IL) in manipulation tasks, providing a communication channel between the human expert and the robot. During training, the model learned to interrelate and capture the correlations among language, vision, and motion control to generate language-conditioned control policies. These policies then provided a simple and intuitive interface for human users to give unstructured commands.

In the future, incorporating unstructured natural language into imitation learning could decrease programming needed for autonomous robots, enabling a natural interaction between humans and robots. This could change the way automated robots are used in industries such as healthcare, retail, manufacturing, and food. By eliminating the need for specific sentence structures, perfect grammar, or domain-specific languages, humans could more easily direct robots to perform tasks such as picking and packing items for shipment from a retail warehouse, or commanding a robotic arm to prepare a meal at a restaurant. In healthcare, an autonomous wheelchair could be driven using voice instructions, while pharmacies could use robotic arms for packaging drugs.

Imitation Learning and the Communications Channel

Working with Arizona State University researchers Simon Stepputtis, Joseph Campbell, Chitta Baral, and Heni Ben Amor, and Oregon State University researcher Stefan Lee, our team presented the paper Language-Conditioned Imitation Learning for Robot Manipulation Tasks at a spotlight presentation at NeurIPS 2020.

Imitation learning provides an easy and engaging way to teach new skills to an agent. Instead of programming, the human can provide a set of demonstrations that are turned into functional or probabilistic representations. However, a limitation of this approach is that the state representation must be carefully designed to ensure that all necessary information for adaptation is available. Neural approaches scale IL to high-dimensional spaces by enabling agents to learn task-specific feature representations. However, these methods lack a communication channel, which would allow the user to provide further information about the intended task, at nearly no additional cost. Hence, both the programmer and the user have to resort to numerical approaches for defining goals.

To overcome these challenges, our team developed an end-to-end, language-conditioned control policy for manipulation tasks composed of a high-level semantic module and low-level controller, integrating language, vision, and control within a single framework.

We treat policy generation as a translation process from language and vision. While we used an end-to-end approach, we conceptually divide it into two parts: semantic and control models. The semantic model creates a unique task representation from language and vision. The control model then translates the task representation into a task-specific control policy while taking the current robot state into account.

Evaluation: Picking and Pouring

We evaluated our approach in a simulated robot task with a table-top setup. In this task, a seven-degree-of-freedom robot manipulator was taught by an expert how to perform a combination of picking and pouring behaviors. At training time, the expert provided both a kinesthetic demonstration of the task and a verbal description such as “pour a little into the red bowl.” The table might feature several differently shaped, sized, and colored objects, which often led to ambiguities in natural-language descriptions. The robot had to learn how to efficiently extract critical information from the available raw-data sources in order to determine what to do, how to do it, or where to go. We showed that our approach leveraged perception, language, and motion modalities to generalize the demonstrated behavior to new user commands or experimental setups.

To generate training and test data, five human experts provided templates for 200 verbal task descriptions, utilizing a synonym replacement approach. IL requires a significant number of demonstrations, so the team used this automatic method to generate demonstrations by creating variations of the same sentence for a task. The model was trained on 40,000 synthetically generated scenarios.

Results of Language-Conditioned Manipulation Task

Our model’s overall task success described the percentage of cases in which the cup was first lifted and then successfully poured into the correct bowl. This sequence of steps was successfully executed in 84% of the new environments. Picking alone achieved a 98% success rate, while pouring resulted in 85%. These results indicate that the model appropriately generalized the trained behavior to changes in object position, verbal command, or perceptual input. The team's results set a benchmark for successfully integrating language, vision, and control.

We utilized auxiliary losses to complement the generated robot control signal. Guiding both the object detection attention and the policy generation yielded performance increases in the pouring task. We also evaluated our model with five new human participants issuing commands and compared it to our synthetic language. Overall, our model responds well to new natural language from new human operators.

Natural language instructions could open up new applications for machine learning and robotics in the future.