本文由 AI 分析生成
建立時間: 2026-03-28 來源: https://human-as-robot.github.io/
Summary
Qiu et al. (2025, arXiv:2503.13441) address the data bottleneck in humanoid robot learning by using egocentric human demonstrations as cross-embodiment training data. They collect PH2D (task-oriented egocentric dataset) and train HAT (Human Action Transformer), a unified policy for both humans and humanoid robots with differentiable action retargeting.
Qiu 等人(2025,arXiv:2503.13441)通過使用以自我為中心的人類示範作為跨具身訓練數據,解決了人形機器人學習中的數據瓶頸問題。他們收集了 PH2D(任務導向的自我中心數據集),並訓練了 HAT(人類動作轉換器)——一個用於人類和人形機器人的統一策略,具有可微分的動作重定向。
Prerequisites
- Imitation learning / behavior cloning
- Humanoid robot manipulation (teleoperation, embodiment gap)
- Transformer-based policy architectures
- Cross-embodiment transfer / sim-to-real
Core Idea
Teleoperated robot data is expensive and hard to scale. Human egocentric video is cheap and abundant but suffers from an “embodiment gap” (different kinematics, perspective, interaction physics). The paper addresses this gap from two angles:
- Data: PH2D is a task-oriented egocentric dataset specifically aligned with humanoid manipulation (not random daily activities); reduces kinematic mismatch
- Model: HAT uses a unified state-action space for both humans and humanoids. A differentiable retargeting module maps human hand/body poses to robot joint angles, enabling co-training without separate human/robot decoders
HAT is co-trained on large-scale human data + smaller-scale robot teleoperation data, treating humans and robots as different “embodiments” within the same framework.
Results
- Human data improves both generalization (new tasks/objects) and robustness (perturbation recovery) of the manipulation policy
- Significantly better data collection efficiency vs. pure teleoperation
- Demonstrated on real humanoid robot platforms
Limitations
Author-stated:
- Embodiment gap cannot be fully closed; some tasks may require robot-specific fine-tuning
- PH2D dataset collection is still task-directed, not passively harvested
Unstated:
- Differentiable retargeting quality depends on anatomical similarity between human and robot morphology — may not generalize to non-humanoid robots
- Evaluation scope limited to tabletop manipulation; locomotion and whole-body tasks not addressed
Reproducibility
- Code/Data: Project page at human-as-robot.github.io (dataset and code availability not fully specified in abstract)
- Compute: Standard robotics lab training scale
Insights
The key contribution is demonstrating that task-aligned egocentric human video can meaningfully reduce the robot data requirement for humanoid manipulation — without requiring expensive motion capture. The differentiable retargeting approach is elegant: it makes the embodiment gap a differentiable function rather than a hard domain gap, enabling end-to-end learning. This work is part of a broader trend (alongside AnyDex, UMI, etc.) of treating cheap human video as a scalable source of robot training signal.
Connections
Raw Excerpt
We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision.