
Visual navigation to objects in a real house
Today’s robots are often static and isolated from humans in structured environments — you can think of the robotic arms used by Amazon to pick and pack products in warehouses. But the true potential of robotics lies in mobile robots operating alongside humans in messy environments like our homes and hospitals — it requires navigational skills.
Imagine dropping a robot in a completely invisible house and asking it to find an object, say a toilet. Humans can do this easily: when looking for a glass of water at a friend’s house we are visiting for the first time, we can easily find the kitchen without having to go to the bedroom or cupboard. But teaching this kind of spatial common sense to robots is challenging.
Many learning-based visual navigation policies have been proposed to overcome this problem. But the visual navigation policies studied have been largely evaluated in simulations. How well do the different method classes work on the robot?
We present a large-scale empirical study of semantic visual navigation methods that compares representative methods of classic, modular, and end-to-end learning approaches in six homes with no prior experience, maps, or instrumentation. We found that modular learning works well in the real world, achieving a 90% success rate. In contrast, end-to-end learning does not, dropping from a 77% simulation to a 23% real-world success rate due to the large image domain gap between simulation and reality.
Object destination navigation
We instantiate semantic navigation with a Target Object navigation task, in which the robot is started in a completely invisible environment and asked to find an instance of an object category, say a toilet. The robot only has access to first-person RGB and depth cameras and pose sensors.
This task is challenging. This not only requires an understanding of spatial scene to distinguish free space and obstacles and an understanding of scene semantics to detect objects, but also requires prior semantic exploration learning. For example, if a human wanted to find a toilet in this scene, most of us would choose the aisle because it most likely leads to one. Teaching this kind of common sense or semantics before autonomous agents is challenging. While exploring the scene for desired objects, the robot also needs to remember explored and unexplored areas.
Method
So how do we train autonomous agents capable of efficiently navigating while overcoming all of these challenges? The classic approach to this problem is building a geometric map using depth sensors, exploring the environment with heuristics, such as boundary exploration, which explore nearby unexplored territory, and using an analytical planner to arrive at the exploration goal and the destination object immediately afterward. outlook. The end-to-end learning approach predicts direct action from raw observations with a deep neural network consisting of a visual encoder for the image frame followed by a repeating layer for the memory. The modular learning approach builds a semantic map by projecting predicted semantic segmentations using depth, predicting exploratory goals with goal-oriented semantic policies as a function of the semantic map and destination objects, and achieving them with the planner.
Large-scale real-world empirical evaluation
While many approaches to navigating to objects have been proposed over the last few years, the learned navigational policies have been largely evaluated in simulations, which opens the field to the risk of sim-only research not generalizing to the real world. We address this issue through large-scale empirical evaluations of representative classics, end-to-end learning, and modular learning approaches across 6 invisible houses and 6 categories of objective objects.
Results
We compared approaches in terms of success rates within a limited budget of 200 robot actions and Success weighted by Pathway Length (SPL), a measure of path efficiency. In simulations, all approaches have comparable performance, with a success rate of around 80%. But in the real world, modular learning and classical approaches transfer surprisingly well, going from 81% to 90% and 78% to 80% success rates, respectively. While end-to-end learning fails to transfer, it drops from 77% to 23% success rate.
We illustrate these results qualitatively with one representative trajectory. All approaches start in the bedroom and are tasked with finding the couch. On the left, the first modular learning achieves the goal of the couch. In the middle, end-to-end learning fails after crashing too many times. On the right, the classic wisdom finally reaches its sofa destination after a detour through the kitchen.
Outcome 1: modular learning is reliable
We found that modular learning was very reliable on robots, with a 90% success rate. Here, we can see him efficiently finding plants in the first house, chairs in the second, and toilets in the third.
Outcome 2: modular learning explores more efficiently than classic
Modular learning increases the real-world success rate by 10% over the classical approach. On the left, a goal-oriented semantic exploratory policy goes straight to the bedroom and finds a bed in 98 steps with an SPL of 0.90. On the right, because the border crawl does not match the purpose of the bed, policy makes a detour through the kitchen and entrance hall before finally reaching the bed in 152 steps at an SPL of 0.52. With a limited time budget, inefficient exploration can lead to failure.
Outcome 3: end-to-end learning fails to transfer
While classical and modular learning approaches work well on robots, end-to-end learning does not, with a success rate of only 23%. Policies frequently collide, revisit the same places, and even fail to stop in front of the destination object when it’s in sight.
Analysis
Insight 1: why is transfer modular while end-to-end isn’t?
Why does modular learning transfer so well while end-to-end learning doesn’t? To answer this question, we reconstructed a real-world house in a simulation and experimented with identical episodes in sim and reality.
The semantic exploration policy of the modular learning approach takes a semantic map as input, while the end-to-end policy operates directly on RGB-D frames. The semantic map space is an invariant between sim and reality, while the image space shows large domain gaps. In this example, this gap leads to a segmentation model that is trained on real-world images to predict false positives of beds in the kitchen.
Domain invariance of semantic maps allows a modular learning approach to transfer well from sim to reality. In contrast, image domain gaps lead to large performance degradation when transferring real-world trained segmentation models to simulations and vice versa. If semantic segmentation transfers poorly from sim to reality, it’s reasonable to expect end-to-end semantic navigation policies trained on sim images to transfer poorly to real-world images.
Insight 2: sim vs real gap in error mode for modular learning
Surprisingly, modular learning works better in reality than simulation. Detailed analysis reveals that many of the modular learning policy failures that occur in the sim are caused by reconstruction errors, which are not the case. Visual reconstruction errors represented 10% of the total 19% failed episodes, and physical reconstruction errors another 5%. In contrast, failures in the real world are mostly due to depth sensor errors, while most semantic navigation benchmarks in simulations assume perfect depth sensing. As well as explaining the performance gap between sim and reality for modular learning, this gap in error modes is cause for concern as it limits the usefulness of simulation to diagnose bottlenecks and further improve policy. We show representative examples of each error mode and propose concrete steps forward to close this gap in the paper.
Takeaways
For practitioners:
- Modular learning can reliably navigate to objects with 90% success.
For researchers:
- Models that rely on RGB images are difficult to transfer from sim to real => take advantage of modularity and abstraction in policy.
- Disconnect between sim and real error mode => evaluate semantic navigation on real robot.
For more content on robotics and machine learning, see my blog.
Theophile Gervet is a PhD student in the Department of Machine Learning at Carnegie Mellon University
Theophile Gervet is a PhD student in the Department of Machine Learning at Carnegie Mellon University