Data Scaling Laws in Imitation Learning for Robotic Manipulation

本文由 AI 分析生成

建立時間： 2026-04-05 來源： https://arxiv.org/abs/2410.18647

Summary

This paper (Tsinghua / Shanghai Qi Zhi, ICLR 2025 Oral) conducts the first systematic empirical study of data scaling laws in robot imitation learning. By collecting 40,000+ UMI demonstrations across diverse real environments and objects, the authors show that generalization follows a power-law with environment and object diversity — not raw demonstration count. The key finding: diversity of environments and objects matters far more than the number of demonstrations per setting.

本論文（清華大學 / 上海期智研究院，ICLR 2025 Oral）是首個系統性研究機器人模仿學習資料縮放規律的工作。透過跨多樣真實環境和物件收集 40,000+ UMI 示範，發現泛化性能遵循冪律（power-law）——關鍵因素是環境和物件的多樣性，而非每個環境的示範數量。

Prerequisites

Imitation Learning / Behavior Cloning (BC) — the paper trains BC policies (Diffusion Policy); understanding the training-deployment distribution shift explains why environment diversity matters
Scaling laws (NLP 背景) — the paper explicitly extends Kaplan et al. power-law framework from LLM to robotics; the analogy helps interpret the formulation
Universal Manipulation Interface (UMI) — all data collected with UMI hand-held gripper; understanding UMI’s portability and SLAM-based action recording explains the 90% valid demo rate
Diffusion Policy — used as the policy learning backbone throughout; understanding its generalization properties contextualizes the scaling results

Core Idea

The central question is whether data scaling laws analogous to LLM scaling exist in robot imitation learning. The key insight is that “diversity is all you need”: once a minimum threshold of demonstrations per environment (~50) is reached, adding more demonstrations in the same environment yields diminishing returns. Instead, expanding to more environments and more object instances produces consistent power-law gains in generalization. This reframes the data collection problem: instead of asking “how many demonstrations per task?”, the question becomes “how many environments and objects can we cover?“. UMI’s portability makes this feasible because collectors can move across locations with minimal setup overhead.

Results

Finding	Result
Relationship type	Power-law: generalization score ∝ N_environments^α, N_objects^β
Demonstrations threshold	50 demos/environment is sufficient; adding more has minimal effect
Practical data collection	4 collectors × 1 afternoon → ~90% zero-shot success on new envs+objects
Visual encoder scaling	ViT-S → ViT-B → ViT-L: 0 → 0.81 → 0.90 (consistent improvement)
Diffusion model scaling	No improvement: small/base/large U-Net all ~0.88–0.90
Tasks validated	Pour Water, Mouse Arrangement (derivation); Fold Towels, Unplug Charger (validation)
Total experiments	40,000+ demos, 15,000+ real-world rollouts

Limitations

Author-stated: only studies single-task generalization; task-level generalization requires thousands of tasks (out of scope)
Author-stated: only studies IL; RL may enhance capabilities further
Author-stated: UMI introduces inherent small trajectory errors; unclear how data quality affects scaling laws
Author-stated: validated on only 4 tasks; broader validation needed
Unstated: UMI restricted to parallel-jaw grippers — scaling laws may differ for dexterous hand or bimanual tasks requiring different hardware
Unstated: SLAM failure rate ~10% means noisy demonstrations exist in training data; unclear if cleaning would shift the scaling curves
Unstated: results from in-the-wild real environments — simulation-only collection may not exhibit the same scaling behavior due to lower visual diversity

Reproducibility

Code/data/models: released at data-scaling-laws.github.io
Hardware: UMI (hand-held parallel-jaw gripper + GoPro) + Diffusion Policy inference
Compute: ViT-L backbone requires significant GPU for full fine-tuning; exact training hours not reported

Insights

The paper delivers a practically actionable conclusion: don’t obsess over demo count per environment; obsess over covering more environments. This inverts the common intuition that more demonstrations per task = better policy.

The model scaling asymmetry is striking: visual encoder size scales (ViT-S → ViT-L gives a clear boost), but action diffusion network size does not scale. This suggests the bottleneck in robot IL is perceptual representation, not action modeling capacity. Full fine-tuning of DINOv2 is essential — frozen features or LoRA both fail.

The 90% success rate achievable in a single afternoon with 4 collectors is a remarkable practical result. It suggests that with proper data strategy, small teams can build deployable policies without months of data collection.

Connection to the UMI vs VR vs MoCap question: This paper doesn’t compare data collection methods, but it uses UMI exclusively and achieves strong results. The implicit message is that for tasks compatible with parallel-jaw grippers in diverse environments, UMI’s portability is the enabling factor for diversity-driven scaling — a VR setup requiring a fixed robot would make collecting 32+ environments prohibitively expensive.

多樣性比數量更重要——這個結論和 2510.10903 §7.1.2 的 Re-Mix（最大化域多樣性）和 MimicLabs（相機姿態和空間多樣性）的發現方向一致，從不同角度支持同一論點：機器人學習的瓶頸在於覆蓋範圍，不在於重複。

Connections

Raw Excerpt

“The diversity of environments and objects is far more important than the absolute number of demonstrations; once the number of demonstrations per environment or object reaches a certain threshold, additional demonstrations have minimal effect.”

bot_vault

Explorer

Data Scaling Laws in Imitation Learning for Robotic Manipulation

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks