Scaling up robotic imitation learning for real-world applications requires efficient and scalable demonstration collection methods. While teleoperation is effective, it depends on costly and inflexible robot platforms. In-the-wild demonstrations offer a promising alternative, but existing collection devices have key limitations: handheld setups offer limited observational coverage, and whole-body systems often require fine-tuning with robot data due to domain gaps. To address these challenges, we present AirExo-2, a low-cost exoskeleton system for large-scale in-the-wild data collection, along with several adaptors that transform collected data into pseudo-robot demonstrations suitable for policy learning. We further introduce RISE-2, a generalizable imitation learning policy that fuses 3D spatial and 2D semantic perception for robust manipulations. Experiments show that RISE-2 outperforms prior state-of-the-art methods on both in-domain and generalization evaluations. Trained solely on adapted in-the-wild data produced by AirExo-2, the RISE-2 policy achieves comparable performance to the policy trained with teleoperated data, highlighting the effectiveness and potential of AirExo-2 for scalable and generalizable imitation learning.
We introduce AirExo-2, an updated low-cost exoskeleton system for large-scale in-the-wild demonstration collection. By transforming the collected in-the-wild demonstrations into pseudo-robot demonstrations, our system addresses key challenges in utilizing in-the-wild demonstrations for downstream imitation learning in the real world.
Here we provide some sample in-the-wild demonstrations and their transformed pseudo-robot demonstrations of the tasks in the paper.
We propose a 3D generalizable policy, RISE-2, to facilitate efficient learning from in-the-wild demonstrations and achieve robust task performance. The design of RISE-2 focuses on the precise feature fusion of 2D images and 3D point clouds to leverage the advantages of 2D vision in semantic information and 3D vision in spatial information simultaneously.
RISE-2 Policy Architecture. RISE-2 takes an RGB-D observation as input and generates continuous actions in the camera frame. It is composed of four modules: (1) the color image is fed into the dense encoder to obtain semantic features organized in 2D form, which is then projected to sparse 3D form using reference coordinates; (2) the depth image is transformed to a point cloud and fed into the sparse encoder to obtain the local geometric features of seed points; (3) in the spatial aligner, the semantic features and the geometric features are aligned and fused using their 3D coordinates; (4) in the action generator, the fused features are converted to sparse point tokens, mapped to action space using a transformer and sparse positional encoding (SPE), and decoded into continuous actions by a diffusion head.
Visualization of Sparse Semantic Features. We observe clear and distinguishable continuous feature variations on the aligned features, where the targets at the current step can be easily identified from the entire scene. Such characteristic ensures precise feature fusion in the spatial domain. The features change significantly as the task progresses, enabling the model to clearly understand the global state at the current time.
We conduct three types of evaluations in our experiments: (1) Policy In-Domain Evaluation: Assessing the policy's performance in the same environment where it was trained using teleoperated demonstrations. (2) Policy Generalization Evaluation: Evaluating how well the policy performs in an unseen environment after being trained with teleoperated demonstrations. (3) System Evaluation: Measuring the overall system performance. Specifically, the RISE-2 policy is trained using in-the-wild demonstrations collected and transformed by AirExo-2, and then deployed zero-shot on a real robot platform for performance evaluation.
Lift Plate (Speedup by 2x)
Open Lid (Speedup by 2x)
Close Lid (Speedup by 2x)
Collect Toys: Unseen Objects
Collect Toys: Unseen Background and Objects
Collect Toys: Unseen Background and Objects
Lift Plate (Speedup by 2x)
Open Lid (Speedup by 2x)
Close Lid (Speedup by 2x)
Serve Steak (Speedup by 3x)
@article{fang2025airexo,
title = {AirExo-2: Scaling up Generalizable Robotic Imitation Learning with Low-Cost Exoskeletons},
author = {Hongjie Fang and Chenxi Wang and Yiming Wang and Jingjing Chen and Shangning Xia and Jun Lv and Zihao He and Xiyan Yi and Yunhan Guo and Xinyu Zhan and Lixin Yang and Weiming Wang and Cewu Lu and Hao-Shu Fang},
journal = {arXiv preprint arXiv:},
year = {2025}
}