AirExo-2: Scaling up Generalizable Robotic Imitation Learning with Low-Cost Exoskeletons

1Shanghai Jiao Tong University, 2Shanghai Noematrix Intelligence Technology Ltd
* Equal Contribution       † Corresponding Authors

Abstract

Scaling up imitation learning for real-world applications requires efficient and cost-effective demonstration collection methods. Current teleoperation approaches, though effective, are expensive and inefficient due to the dependency on physical robot platforms. Alternative data sources like in-the-wild demonstrations can eliminate the need for physical robots and offer more scalable solutions. However, existing in-the-wild data collection devices have limitations: handheld devices offer restricted in-hand camera observation, while whole-body devices often require fine-tuning with robot data due to action inaccuracies. In this paper, we propose AirExo-2, a low-cost exoskeleton system for large-scale in-the-wild demonstration collection. By introducing the demonstration adaptor to transform the collected in-the-wild demonstrations into pseudo-robot demonstrations, our system addresses key challenges in utilizing in-the-wild demonstrations for downstream imitation learning in real-world environments. Additionally, we present RISE-2, a generalizable policy that integrates 2D and 3D perceptions, outperforming previous imitation learning policies in both in-domain and out-of-domain tasks, even with limited demonstrations. By leveraging in-the-wild demonstrations collected and transformed by the AirExo-2 system, without the need for additional robot demonstrations, RISE-2 achieves comparable or superior performance to policies trained with teleoperated data, highlighting the potential of AirExo-2 for scalable and generalizable imitation learning.

AirExo-2

We introduce AirExo-2, an updated low-cost exoskeleton system for large-scale in-the-wild demonstration collection. By transforming the collected in-the-wild demonstrations into pseudo-robot demonstrations, our system addresses key challenges in utilizing in-the-wild demonstrations for downstream imitation learning in the real world.

RISE-2

We propose a 3D generalizable policy, RISE-2, to facilitate efficient learning from in-the-wild demonstrations and achieve robust task performance. The design of RISE-2 focuses on the precise feature fusion of 2D images and 3D point clouds to leverage the advantages of 2D vision in semantic information and 3D vision in spatial information simultaneously.

RISE-2 Policy Architecture. RISE-2 takes an RGB-D observation as input and generates continuous actions in the camera frame. It is composed of four modules: (1) the color image is fed into the dense encoder to obtain semantic features organized in 2D form, which is then projected to sparse 3D form using reference coordinates; (2) the depth image is transformed to a point cloud and fed into the sparse encoder to obtain the local geometric features of seed points; (3) in the spatial aligner, the semantic features and the geometric features are aligned and fused using their 3D coordinates; (4) in the action generator, the fused features are converted to sparse point tokens, mapped to action space using a transformer and sparse positional encoding (SPE), and decoded into continuous actions by a diffusion head.

Visualization of Sparse Semantic Features. We observe clear and distinguishable continuous feature variations on the aligned features, where the targets at the current step can be easily identified from the entire scene. Such characteristic ensures precise feature fusion in the spatial domain. The features change significantly as the task progresses, enabling the model to clearly understand the global state at the current time.

Experimental Results

We conduct three types of evaluations in our experiments: (1) Policy In-Domain Evaluation: Assessing the policy's performance in the same environment where it was trained using teleoperated demonstrations. (2) Policy Generalization Evaluation: Evaluating how well the policy performs in an unseen environment after being trained with teleoperated demonstrations. (3) System Evaluation: Measuring the overall system performance. Specifically, the RISE-2 policy is trained using in-the-wild demonstrations collected and transformed by AirExo-2, and then deployed zero-shot on a real robot platform for performance evaluation.

Policy In-Domain Evaluations: RISE-2


Collect Toys


Lift Plate (Speedup by 2x)


Policy Generalization Evaluations: RISE-2



Collect Toys: Unseen Background


Collect Toys: Unseen Objects


Collect Toys: Unseen Background and Objects


System Evaluation: AirExo-2 and RISE-2


In the following evaluations, the RISE-2 policy is trained only with demonstrations collected and transformed by the AirExo-2 system, without any teleoperated demonstrations.

Collect Toys


Collect Toys: Unseen Background and Objects


Lift Plate (Speedup by 2x)


Challenging Task: Serve Steak (long-horizon, contact-rich)

BibTeX


@article{fang2025airexo,
  title   = {AirExo-2: Scaling up Generalizable Robotic Imitation Learning with Low-Cost Exoskeletons},
  author  = {Hongjie Fang and Chenxi Wang and Yiming Wang and Jingjing Chen and Shangning Xia and Jun Lv and Zihao He and Xiyan Yi and Yunhan Guo and Xinyu Zhan and Lixin Yang and Weiming Wang and Cewu Lu and Hao-Shu Fang},
  journal = {arXiv preprint arXiv:},
  year    = {2025}
}