[Related to my research]
[Importance of robotics benchmarking]
[Note, only talking about mobile manipulation]
Key Works-- Diversity of scenes
Within the embodied AI benchmarking sphere, an early seminal work was VirtualHome (2018), which modeled these complex home tasks as sequences of basic actions (called "programs") and collected data for annotating the simulation videos with programs by using a game-like interface with the annotators. The VirtualHome simulator itself was built on the Unity3D game engine, which allowed for the use of kinematic, physics and navigation models. [1].
VirtualHome was a pioneering work as it had data for over 500 activities and over 300 object categories that can be involved with various tasks, numbers which many follow-up works did not have. However, its weakness lay in the relatively simple kinematics and dynamics models it used from Unity3D, which led to imperfect collision modeling (e.g. sometimes the object the human holds would sink into their hands).
FIG.1. The VirtualHome simulator and an example of a "program" for pouring milk [1]
Similarly, also within the domain of "instruction-following" was ALFRED, which served as a benchmark for translating language (instructions) to sequences of practical actions and interactions [2]. ALFRED's strength over VirtualHome lay in its use more realistic AI2-THOR 2.0 simulator, the higher number of scenes it could simulate (120 vs. 6), and the inclusion of both low-level and high-level goal annotation for better downstream human-model interaction. However, its weakness lay in the continued limitation of realistic physics simulations, especially the discretized task execution movement.
Key works-- Strides in realism
Concurrently, a number of works attempted to go down the route of targeting realistic action execution and accurate physics simulation for rigid bodies. Habitat 2.0 (2021), a virtual environment directed at household re-arrangement tasks [3], emphasized realism through an artist-authored 3D dataset of real apartments (called ReplicaCAD) and a very fast & high-performance physics-enabled 3D simulator (>25,000 simulation steps per second).
FIG. 2. Robot performing a re-arrangement task in the Habitat 2.0 environment [2]
However, in a trend that quickly becomes evident, Habitat 2.0 is fairly limited in its scope as a benchmark, with only 6 scenes and 3 activities to be simulated. Moreover, though it can execute extremely realistic action sequences involving the robots and a variety of objects, the physical properties that can be simulated are also restricted to rigid material interactions (cannot model fluids or deformable objects). The high accuracy of realism in the simulations but low generalizability can also be observed in works such as TDW Transport [4], iGibson [5] and SAPIEN ManiSkill [6].
RFUniverse (2022), a very recent embodied AI simulator, actually goes much further than Habitat 2.0 in its ability to model flexible materials, deformable bodies, realistic fluid and thermal effects (steam/fire) [7]. RFUniverse has a unique focus on pairing atomic actions to related objects and their interaction physics and enabling the appropriate corresponding physics-based interaction.
FIG.3. Examples of unique properties that the RFUniverse environment can model
Still, it is even more limited in the breadth of its scene/activity diversity, with 5 activities and 1 scene. In this way, it serves more as a demonstration for how to effectively generate hyper-realistic simulation environments as opposed to a comprehensive benchmark. In addition, RFUniverse also cannot model continuous extended states (such as temperature or wetness).
Current work (and where my project BEHAVIOR fits in)
As seen above, a recurrent theme is lack of realism (e.g. in the VirtualHome environment, there is a human sitting on toilet in full-clothing) and the lack of diversity (e.g. RFUniverse extremely limited in the number of scenes/activities). Therefore, to create a benchmark for embodied AI in the vein of what ImageNet did for computer vision, the benchmark should be sufficiently broad and also realistic. It should also be grounded in what sorts of activities humans are most likely to do, and not just what are the most convenient for robots (how frequently will a human re-arrange fruit in a bowl vs. get a glass of water for example). This is where the project that I worked on for the past year, BEHAVIOR, comes in [8]. BEHAVIOR for 100 activities was released last year, and this year, we have expanded it to 1000 activities, gathered from humans on what they would be most likely to use.
Footnote: Outside of mobile manipulation
In this article, I only touched upon work in the domain of mobile manipulation, meaning that the simulator could handle tasks that required a synergistic combination of navigation and interaction. Some other classes of works include:
Gibson [9] for navigation, virtual simulator focused on real-world scene adaption to facilitate real-world-like perception (2018), for came soon after. Gibson targeted virtualizing real spaces, rather than artificially designed ones, and interfaced with the stronger, more realistic Bullet Physics engine due to its specific focus on perception.
Static Manipulation: Rearr. T2 (OCRTOC) IKEA Furniture Assembly RLBench Metaworld Robosuite SoftGym DeepMind
Comments