本文由 AI 分析生成
建立時間: 2026-03-27 來源: https://arxiv.org/abs/2603.04356
Summary
RoboCasa365 is a large-scale kitchen simulation benchmark for training and evaluating generalist household robots. It provides 365 tasks (65 atomic + 300 composite) across 2,500 kitchen scenes, paired with 2,000+ hours of demonstration data (612h human + 1,615h synthetic via MimicGen). Accepted to ICLR 2026. Key finding: foundation model pretraining on diverse tasks yields ~3x data efficiency improvement on downstream tasks.
RoboCasa365 是一個大規模廚房模擬基準,提供 365 個任務(65 個原子 + 300 個複合)跨 2,500 個廚房場景,配備 2,000+ 小時示範資料(612 小時人類 + 1,615 小時 MimicGen 合成)。核心發現:在多樣化任務上進行基礎模型預訓練可帶來約 3 倍的下游任務數據效率提升。
Prerequisites
- Mobile manipulation — many tasks (220/365) require combined base navigation + arm manipulation; understanding this distinction is key to interpreting the results split
- Behavior cloning / imitation learning — all baselines are trained via supervised learning on demonstrations; BC is the baseline against which foundation models are compared
- MimicGen — the synthetic data generation pipeline used to scale from human demos to 1,615h of training data; knowing it auto-generates trajectories from a small seed set explains the “mixed-quality” concern
- π₀ / GR00T N1.5 — the foundation model baselines; familiarity with VLA-style models helps contextualize the 20-51% performance range
Core Idea
The core contribution is a benchmark that stresses compositional generalization — combining atomic skills in unseen sequences — which is the primary failure mode of current robot policies. The 365-task design is deliberate: 65 atomic tasks that cover the skill primitives, and 300 composite tasks that chain them in training-seen and training-unseen combinations. The benchmark tests three paradigms simultaneously (multi-task, foundation model pretraining, lifelong learning), allowing apples-to-apples comparison. The sim-to-real gap is addressed by training on 2,500 diverse kitchen scenes; the real-world result (79.8% vs 61.8% for sim+real vs real-only) validates the sim data’s utility.
Results
| Method | Atomic | Composite-Seen | Composite-Unseen | Avg |
|---|---|---|---|---|
| Diffusion Policy | 15.7% | 0.2% | 1.25% | 6.1% |
| π₀ | 36.3% | 5.2% | 0.7% | 15.0% |
| π₀.₅ | 39.6% | 7.1% | 1.2% | 16.9% |
| GR00T N1.5 | 43.0% | 9.6% | 4.4% | 20.0% |
| GR00T N1.5 + pretraining + 100% target data | — | — | — | 51.1% |
Sim-to-real: Sim+real training → 79.8% vs real-only baseline → 61.8% (+18pp)
Pretraining data efficiency: ~3x improvement with foundation model pretraining vs. training from scratch on target data only.
Limitations
- Author-stated: Limited to kitchen environments; generalization findings may not transfer to other household settings
- Author-stated: Simulation cannot capture full sensory and physical complexity of real-world deployment
- Author-stated: Sim-to-real gap “remains a significant challenge”
- Unstated: Even the best model (GR00T N1.5 fine-tuned) achieves only 51.1% average — composite-unseen tasks remain largely unsolved (4.4% is low)
- Unstated: MimicGen synthetic data showed “mixed results” — large mixed-quality datasets don’t reliably improve generalization, suggesting data quality matters more than quantity
Reproducibility
- Code/Data: Open-sourced at https://robocasa.ai; all trained models released
- Datasets: 612h human demos (55k episodes) + 1,615h MimicGen synthetic; 2,500 kitchen scenes
- Compute: Training details not specified in available content; GR00T N1.5 is a large foundation model requiring significant GPU resources
Insights
The most important negative result: even with 2,000+ hours of data and the best available foundation models, composite-unseen task performance remains near-floor (1–4%). This suggests current approaches — even at scale — have not solved compositional generalization for robot manipulation. The 3x pretraining efficiency gain is encouraging but the absolute numbers tell a more sobering story.
The sim-to-real result (+18pp) is practically significant: it means large-scale simulation data is genuinely useful for real-world deployment, not just benchmarking. This validates investing in simulation infrastructure for labs that lack extensive real-world data collection capacity.