Humanoid Policy ~ Human Policy (PH2D + HAT)

本文由 AI 分析生成

建立時間： 2026-03-26 來源： https://arxiv.org/abs/2503.13441

Summary

This paper treats egocentric human demonstrations as cross-embodiment robot training data, bypassing the need for robot hardware during collection. The PH2D dataset (50k+ frames) is collected via Apple Vision Pro or Meta Quest 3, and the Human Action Transformer (HAT) uses a unified 54-dim state space shared by humans and robots. Human data improves out-of-distribution generalization by 71% and collection is 5x faster than robot teleoperation.

這篇論文將自我中心的人類示範視為跨形態機器人訓練資料，無需機器人硬體即可收集。PH2D 資料集透過 Apple Vision Pro 或 Meta Quest 3 收集 5 萬幀，HAT 使用人類與機器人共享的 54 維統一狀態空間。人類資料將 OOD 泛化性能提升 71%，收集速度比機器人遠端操作快 5 倍。

Prerequisites

Cross-embodiment learning — understanding how policies can transfer between agents with different morphologies is foundational for interpreting HAT’s design choices
DINOv2 / ViT encoders — the visual backbone is frozen DINOv2 ViT-S; knowing that these are large-scale vision transformers explains why color jitter augmentation bridges the human-robot appearance gap
Inverse kinematics (IK) — human hand poses must be retargeted to robot joint angles; IK is how this mapping is computed

Core Idea

The key insight is that humans and robots performing manipulation tasks share enough structure in their hand poses and wrist trajectories that a unified 54-dimensional state space (6D rotations of head/wrists + 3D coordinates of wrists/fingertips) is sufficient for both. Rather than needing separate affordance representations, HAT learns from mixed human+robot data in the same format. The embodiment gap is handled practically: collectors sit upright (no whole-body motion), actions are time-stretched 4x to match robot speed, and visual augmentation handles camera appearance differences. The result is that the gap narrows enough for the cross-embodiment signal to be beneficial.

Results

Setting	With Human Data	Without Human Data	Improvement
In-Distribution tasks	49/60	45/60	+9%
Out-Of-Distribution tasks	101/170	59/170	+71%
Data collection speed	4.09s/task (VR)	19.72s/task (teleop)	~5x faster

Few-shot: 20 robot demos + human data significantly outperforms robot-only baseline.

Limitations

Author-stated: embodiment gap mitigations (upright posture, time-stretching) impose constraints on how demonstrators must behave
Author-stated: IK retargeting introduces errors when human hand configurations fall outside the robot’s reachable joint space
Unstated: the 71% OOD improvement is on controlled variations (backgrounds, appearances, spatial arrangements); robustness to truly novel tasks or objects is untested
Unstated: the ~$700 Quest 3 setup is a cost floor that may still be prohibitive for some labs, and Apple Vision Pro is significantly more expensive

Reproducibility

Code: project page at https://human-as-robot.github.io/ (code availability not confirmed)
Datasets: PH2D dataset (50k+ frames); standard bimanual manipulation benchmarks for evaluation
Compute: inference and training details not specified in the clipping; likely requires standard GPU compute for ViT-S fine-tuning

Insights

The most important result is the asymmetry: human data helps OOD generalization far more than in-distribution performance (+71% vs +9%). This makes intuitive sense — the value of human data is diversity (different backgrounds, lighting, viewpoints from natural collection), not quantity of the exact target distribution. This suggests human data is best thought of as a regularizer against overfitting to robot lab conditions, rather than a substitute for high-quality robot demonstrations.

Connections

Raw Excerpt

Human data improves OOD generalization (novel backgrounds, object appearances, spatial arrangements) far more than in-distribution performance. Treating humans and robots as different embodiments in a unified framework is sufficient — no separate affordance representations needed.

bot_vault

Explorer

Humanoid Policy ~ Human Policy (PH2D + HAT)

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks