Humanoid Policy ~ Human Policy

本文由 AI 分析生成

建立時間： 2026-03-28 來源： https://human-as-robot.github.io/

Summary

Qiu et al. (2025, arXiv:2503.13441) address the data bottleneck in humanoid robot learning by using egocentric human demonstrations as cross-embodiment training data. They collect PH2D (task-oriented egocentric dataset) and train HAT (Human Action Transformer), a unified policy for both humans and humanoid robots with differentiable action retargeting.

Qiu 等人（2025，arXiv:2503.13441）通過使用以自我為中心的人類示範作為跨具身訓練數據，解決了人形機器人學習中的數據瓶頸問題。他們收集了 PH2D（任務導向的自我中心數據集），並訓練了 HAT（人類動作轉換器）——一個用於人類和人形機器人的統一策略，具有可微分的動作重定向。

Prerequisites

Imitation learning / behavior cloning
Humanoid robot manipulation (teleoperation, embodiment gap)
Transformer-based policy architectures
Cross-embodiment transfer / sim-to-real

Core Idea

Teleoperated robot data is expensive and hard to scale. Human egocentric video is cheap and abundant but suffers from an “embodiment gap” (different kinematics, perspective, interaction physics). The paper addresses this gap from two angles:

Data: PH2D is a task-oriented egocentric dataset specifically aligned with humanoid manipulation (not random daily activities); reduces kinematic mismatch
Model: HAT uses a unified state-action space for both humans and humanoids. A differentiable retargeting module maps human hand/body poses to robot joint angles, enabling co-training without separate human/robot decoders

HAT is co-trained on large-scale human data + smaller-scale robot teleoperation data, treating humans and robots as different “embodiments” within the same framework.

Results

Human data improves both generalization (new tasks/objects) and robustness (perturbation recovery) of the manipulation policy
Significantly better data collection efficiency vs. pure teleoperation
Demonstrated on real humanoid robot platforms

Limitations

Author-stated:

Embodiment gap cannot be fully closed; some tasks may require robot-specific fine-tuning
PH2D dataset collection is still task-directed, not passively harvested

Unstated:

Differentiable retargeting quality depends on anatomical similarity between human and robot morphology — may not generalize to non-humanoid robots
Evaluation scope limited to tabletop manipulation; locomotion and whole-body tasks not addressed

Reproducibility

Code/Data: Project page at human-as-robot.github.io (dataset and code availability not fully specified in abstract)
Compute: Standard robotics lab training scale

Insights

The key contribution is demonstrating that task-aligned egocentric human video can meaningfully reduce the robot data requirement for humanoid manipulation — without requiring expensive motion capture. The differentiable retargeting approach is elegant: it makes the embodiment gap a differentiable function rather than a hard domain gap, enabling end-to-end learning. This work is part of a broader trend (alongside AnyDex, UMI, etc.) of treating cheap human video as a scalable source of robot training signal.

Connections

Raw Excerpt

We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision.

bot_vault

Explorer

Humanoid Policy ~ Human Policy

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks