Published March 21st, 2022
Photo Credit: Rami Shlush
By Gal Leibovich and Guy Jacob - AI Research Engineers at the Causal and Reinforcement Learning Lab at Intel Labs, where they focus on developing AI algorithms for reinforcement and robot learning.
Robot learning research is often guided by a somewhat elusive quest: effectively transferring policies learned in simulation to the real world. If we could effectively rank the policies learned in simulation without running them in the real world, deployments would be swifter, safer, and certainly less expensive. To advance this goal, Intel Labs, in collaboration with the Technion - Israel Institute of Technology, developed a “Validate on Sim, Detect on Real” (VSDR) approach for learning robotic skills. The approach yielded significantly promising results which have been published in a paper to be presented at the 2022 International Conference on Robotics and Automation (ICRA). In this blog, we offer a brief recap of the methodology and our test results.
Reinforcement learning (RL) is a popular method for learning various robotic skills. Training and evaluating RL policies on a real robot, however, is expensive, time-consuming, and can even be unsafe. Simulation helps, but because simulation and reality often cannot be perfectly matched, policies trained in simulation do not necessarily perform well in the real world. This is commonly referred to as the “sim-to-real gap”.
Several techniques have been proposed to alleviate this gap. Among the most popular of these is Domain Randomization (DR). In DR, instead of training on a single simulation environment, policies are trained on multiple simulated domains with random environmental variations such as textures, lighting, and dynamics. The hope is that policies trained using DR will be invariant to these variations, and will see the real world as just another variation, therefore mitigating the sim-to-real gap. However, successfully running RL and DR algorithms typically involves fine-tuning a large number of hyper-parameters. This can be a fast, automated process in simulation, but evaluating all learned policies in a lab with a real robot is time- and cost-prohibitive. We wanted to find out if we could effectively rank policies without extensive real-world evaluation.
Ranking Models Based on Combined Simulation and Real-World Scores
The method we propose, VSDR, combines prior evaluations completed in simulation with out-of-distribution (OOD) detection on real-world data as a way to narrow the sim-to-real gap. Essentially, we combine the following into a single policy score:
- Evaluation in simulation: Here, we evaluate the performance of learned policies on domains that they were not originally trained on. This evaluation is an indication of how well the model is generalizing to unseen domains in simulation. A well-trained DR policy will obtain a high performance in evaluation in simulation.
- Out-of-distribution detection (OOD): Here, we collect some observations from the real world – aka target domain – that represent the task we want to perform (a few dozen observations were sufficient for our experiments). We collect this observations set without using the aforementioned policies trained in simulation. For many tasks in robotics, this can be done with programmed heuristics or via teleoperation. Then, we use out-of-distribution (OOD) detection to evaluate how “familiar” these real-world observations are to the trained policies compared to the simulated observations seen during training. In other words, we wanted to determine to what extent the policies see the real world as just another domain variation. In this specific study, the OOD detection is based on Gaussian mixture models (GMM).
Figure 1. An outline of our method for selecting the best models to deploy in the real world. The bottom-left image in the diagram shows the lab setup used for real-world experiments, consisting of a Franka Panda robotic arm and an Intel RealSense D415 camera.
To evaluate our method, we performed an extensive study on a robotic grasping task with image inputs. The task goal was to have the robot grasp a cube located on a table in front of it, lift it, and hold it at a certain height.
To measure the accuracy of our policy selection method, we computed Spearman’s rank correlation coefficient between the ranking produced by VSDR and the ground-truth (GT) ranking (i.e. ranking of policies by their real-world performance). GT can be measured by any metric suitable for the task; we experimented with both cumulative reward and binary success metrics.
VSDR Outperforms Baseline Scores Across Hyper-parameters
We trained over 60 policies in simulation, using various DR configurations (see Figure 2). To obtain GT rankings, we tested each policy on 49 grasp attempts in the real world, accumulating over 130 hours of real-world evaluations. We used the Off-Policy Classification (OPC) and Soft Off-Policy Classification (SoftOPC) [i] methods as baselines for comparison.
Figure 2. Examples of image observations. The four images on the left show different DR configurations in simulation. The rightmost image shows an observation from our real-world experiments. DTD refers to the Describable Textures Dataset [ii].
We performed an extensive study of hyper-parameters. Among these are the type of DR used for evaluation in simulation, the performance metric used for GT rankings, and various GMM-related parameters. Figure 3 shows how these were combined to generate our results.
Figure 3. Spearman correlations with ground-truth rankings, for rankings based on our method and baselines. The 2 bar groups on the left represent rankings based only on Out-of-distribution (OOD) detection in the real-world (i.e. GMM-only), and the 2 bar groups on the right represent VSDR-based rankings. Each bar group corresponds to the number of components used for the Gaussian mixture model (GMM), and each bar refers to a different layer embedding used to fit the GMM. Horizontal lines represent correlations for rankings calculated by baseline methods. We note that VSDR outperforms the individual component scores and the baseline scores across the different hyper-parameter settings.
Our results indicate that the combination of the individual component scores – evaluation in simulation and OOD detection – is complementary. As the graph shows, VSDR outperformed the individual component scores and the baseline scores (OPC and SoftOPC). While the figure shows only a subset of all hyper-parameter combinations for clarity, these results were consistent across the board. The full analysis is available in the paper. In short, VSDR significantly improves the accuracy of policy ranking and only requires dozens of real-world observations.
For more detailed information about our method and tests, please see our paper, “Validate on Sim, Detect on Real – Model Selection for Domain Randomization.”
This research was conducted by Gal Leibovich, Guy Jacob, Shadi Endrawis, Gal Novik and Aviv Tamar.
[i] A. Irpan, K. Rao, K. Bousmalis, C. Harris, J. Ibarz, and S. Levine. Off-policy evaluation via off-policy classification. In Advances in Neural Information Processing Systems, volume 32, 2019.
[ii] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.