EgoMimic: Scaling Imitation Learning via Egocentric Video

本文由 AI 分析生成

建立時間： 2026-04-05 來源： https://arxiv.org/abs/2410.24221

Summary

EgoMimic scales imitation learning by treating human egocentric video + 3D hand tracking as equivalent demonstration data alongside robot demonstrations. Using Project Aria glasses for collection and a co-training architecture that jointly learns from human and robot data, it shows that 1 hour of human hand data is more valuable than 1 hour of additional robot data. The approach achieves state-of-the-art on diverse long-horizon manipulation tasks and generalizes to new scenes.

EgoMimic 使用 Project Aria 眼鏡收集自中心（egocentric）人類視頻和 3D 手部追蹤，與機器人示範聯合訓練，發現 1 小時人類手部數據比 1 小時機器人數據更有價值。在多樣化長程操作任務上達到 SOTA，並能泛化到新場景。

Prerequisites

Egocentric vision — the observation space is a first-person view from glasses-mounted cameras; standard robotics third-person-view assumptions don’t apply
Cross-domain co-training — training on human and robot data jointly requires careful domain alignment; batch mixing strategy and architecture sharing choices matter
3D hand pose estimation — Project Aria provides accurate 3D joint tracking; understanding the quality limits of this tracking helps assess where the approach breaks down
Embodiment gap — human hand joint kinematics differ from robot gripper kinematics; the alignment procedure is the technical crux

Core Idea

The core insight is that the action space for manipulation — moving an end-effector through 3D space to manipulate objects — is fundamentally the same for humans and robots, even if the exact kinematics differ. By aligning the observation spaces (egocentric + depth), retargeting hand joints to robot end-effector poses, and co-training on both modalities, the policy can leverage the much larger and easier-to-collect human data to learn manipulation priors, then refine with robot data for precise motor control. The finding that human data is more sample-efficient than robot data implies that human demonstrations capture richer task structure (semantics, sequencing, object interactions) that robot-only IL struggles to learn efficiently.

Results

Metric	EgoMimic	Prior IL SOTA	Delta
Long-horizon manipulation tasks	Significantly better	Baseline	Multiple tasks
Data efficiency: 1hr human vs 1hr robot	Human data more valuable	—	Key finding
New scene generalization	Achieved	Limited	Qualitative improvement

Note: exact numbers depend on specific task; paper reports improvements across diverse task suite.

Limitations

Author-stated: kinematic gap not fully eliminated; fine-grained contact tasks still challenging
Author-stated: requires calibration between Aria glasses and robot camera at deployment time
Unstated: Project Aria is a research device not commercially available to consumers — practical accessibility limited
Unstated: the “low-cost bimanual manipulator” is custom-designed; adapting EgoMimic to off-the-shelf robots requires redesigning the kinematic alignment
Unstated: human-data scaling benefit likely diminishes for tasks requiring precise robot-specific skills (e.g., contact force control)

Reproducibility

Code: available (CMU)
Hardware: Project Aria glasses (research device), custom bimanual manipulator
Compute: co-training requires GPU; comparable to standard IL training

Insights

EgoMimic inverts the data collection cost structure: instead of expensive robot teleoperation, researchers can collect data naturally while wearing glasses. The scaling law finding — human data more valuable per hour than robot data — has profound implications. If this holds generally, it suggests the field should invest more in scaling human data collection infrastructure (better headsets, easier calibration) rather than in making robot teleoperation faster.

The approach connects to the broader trend of treating humans as a general-purpose data source for embodied AI: Open X-Embodiment treats robot data from diverse robots as equivalent; EgoMimic takes this one step further by treating human data as equivalent. The natural end state of this trend is VLA training on internet-scale human video.

Connections

Raw Excerpt

“Adding 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data.”

bot_vault

Explorer

EgoMimic: Scaling Imitation Learning via Egocentric Video

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks