Summary

This paper (Tsinghua / Shanghai Qi Zhi, ICLR 2025 Oral) conducts the first systematic empirical study of data scaling laws in robot imitation learning. By collecting 40,000+ UMI demonstrations across diverse real environments and objects, the authors show that generalization follows a power-law with environment and object diversity — not raw demonstration count. The key finding: diversity of environments and objects matters far more than the number of demonstrations per setting.

本論文(清華大學 / 上海期智研究院,ICLR 2025 Oral)是首個系統性研究機器人模仿學習資料縮放規律的工作。透過跨多樣真實環境和物件收集 40,000+ UMI 示範,發現泛化性能遵循冪律(power-law)——關鍵因素是環境和物件的多樣性,而非每個環境的示範數量。

Prerequisites

  • Imitation Learning / Behavior Cloning (BC) — the paper trains BC policies (Diffusion Policy); understanding the training-deployment distribution shift explains why environment diversity matters
  • Scaling laws (NLP 背景) — the paper explicitly extends Kaplan et al. power-law framework from LLM to robotics; the analogy helps interpret the formulation
  • Universal Manipulation Interface (UMI) — all data collected with UMI hand-held gripper; understanding UMI’s portability and SLAM-based action recording explains the 90% valid demo rate
  • Diffusion Policy — used as the policy learning backbone throughout; understanding its generalization properties contextualizes the scaling results

Core Idea

The central question is whether data scaling laws analogous to LLM scaling exist in robot imitation learning. The key insight is that “diversity is all you need”: once a minimum threshold of demonstrations per environment (~50) is reached, adding more demonstrations in the same environment yields diminishing returns. Instead, expanding to more environments and more object instances produces consistent power-law gains in generalization. This reframes the data collection problem: instead of asking “how many demonstrations per task?”, the question becomes “how many environments and objects can we cover?“. UMI’s portability makes this feasible because collectors can move across locations with minimal setup overhead.

Results

FindingResult
Relationship typePower-law: generalization score ∝ N_environments^α, N_objects^β
Demonstrations threshold50 demos/environment is sufficient; adding more has minimal effect
Practical data collection4 collectors × 1 afternoon → ~90% zero-shot success on new envs+objects
Visual encoder scalingViT-S → ViT-B → ViT-L: 0 → 0.81 → 0.90 (consistent improvement)
Diffusion model scalingNo improvement: small/base/large U-Net all ~0.88–0.90
Tasks validatedPour Water, Mouse Arrangement (derivation); Fold Towels, Unplug Charger (validation)
Total experiments40,000+ demos, 15,000+ real-world rollouts

Limitations

  • Author-stated: only studies single-task generalization; task-level generalization requires thousands of tasks (out of scope)
  • Author-stated: only studies IL; RL may enhance capabilities further
  • Author-stated: UMI introduces inherent small trajectory errors; unclear how data quality affects scaling laws
  • Author-stated: validated on only 4 tasks; broader validation needed
  • Unstated: UMI restricted to parallel-jaw grippers — scaling laws may differ for dexterous hand or bimanual tasks requiring different hardware
  • Unstated: SLAM failure rate ~10% means noisy demonstrations exist in training data; unclear if cleaning would shift the scaling curves
  • Unstated: results from in-the-wild real environments — simulation-only collection may not exhibit the same scaling behavior due to lower visual diversity

Reproducibility

  • Code/data/models: released at data-scaling-laws.github.io
  • Hardware: UMI (hand-held parallel-jaw gripper + GoPro) + Diffusion Policy inference
  • Compute: ViT-L backbone requires significant GPU for full fine-tuning; exact training hours not reported

Insights

The paper delivers a practically actionable conclusion: don’t obsess over demo count per environment; obsess over covering more environments. This inverts the common intuition that more demonstrations per task = better policy.

The model scaling asymmetry is striking: visual encoder size scales (ViT-S → ViT-L gives a clear boost), but action diffusion network size does not scale. This suggests the bottleneck in robot IL is perceptual representation, not action modeling capacity. Full fine-tuning of DINOv2 is essential — frozen features or LoRA both fail.

The 90% success rate achievable in a single afternoon with 4 collectors is a remarkable practical result. It suggests that with proper data strategy, small teams can build deployable policies without months of data collection.

Connection to the UMI vs VR vs MoCap question: This paper doesn’t compare data collection methods, but it uses UMI exclusively and achieves strong results. The implicit message is that for tasks compatible with parallel-jaw grippers in diverse environments, UMI’s portability is the enabling factor for diversity-driven scaling — a VR setup requiring a fixed robot would make collecting 32+ environments prohibitively expensive.

多樣性比數量更重要——這個結論和 2510.10903 §7.1.2 的 Re-Mix(最大化域多樣性)和 MimicLabs(相機姿態和空間多樣性)的發現方向一致,從不同角度支持同一論點:機器人學習的瓶頸在於覆蓋範圍,不在於重複。

Connections

Raw Excerpt

“The diversity of environments and objects is far more important than the absolute number of demonstrations; once the number of demonstrations per environment or object reaches a certain threshold, additional demonstrations have minimal effect.”