Summary

RoboCasa365 is a large-scale kitchen simulation benchmark for training and evaluating generalist household robots. It provides 365 tasks (65 atomic + 300 composite) across 2,500 kitchen scenes, paired with 2,000+ hours of demonstration data (612h human + 1,615h synthetic via MimicGen). Accepted to ICLR 2026. Key finding: foundation model pretraining on diverse tasks yields ~3x data efficiency improvement on downstream tasks.

RoboCasa365 是一個大規模廚房模擬基準,提供 365 個任務(65 個原子 + 300 個複合)跨 2,500 個廚房場景,配備 2,000+ 小時示範資料(612 小時人類 + 1,615 小時 MimicGen 合成)。核心發現:在多樣化任務上進行基礎模型預訓練可帶來約 3 倍的下游任務數據效率提升。

Prerequisites

  • Mobile manipulation — many tasks (220/365) require combined base navigation + arm manipulation; understanding this distinction is key to interpreting the results split
  • Behavior cloning / imitation learning — all baselines are trained via supervised learning on demonstrations; BC is the baseline against which foundation models are compared
  • MimicGen — the synthetic data generation pipeline used to scale from human demos to 1,615h of training data; knowing it auto-generates trajectories from a small seed set explains the “mixed-quality” concern
  • π₀ / GR00T N1.5 — the foundation model baselines; familiarity with VLA-style models helps contextualize the 20-51% performance range

Core Idea

The core contribution is a benchmark that stresses compositional generalization — combining atomic skills in unseen sequences — which is the primary failure mode of current robot policies. The 365-task design is deliberate: 65 atomic tasks that cover the skill primitives, and 300 composite tasks that chain them in training-seen and training-unseen combinations. The benchmark tests three paradigms simultaneously (multi-task, foundation model pretraining, lifelong learning), allowing apples-to-apples comparison. The sim-to-real gap is addressed by training on 2,500 diverse kitchen scenes; the real-world result (79.8% vs 61.8% for sim+real vs real-only) validates the sim data’s utility.

Results

MethodAtomicComposite-SeenComposite-UnseenAvg
Diffusion Policy15.7%0.2%1.25%6.1%
π₀36.3%5.2%0.7%15.0%
π₀.₅39.6%7.1%1.2%16.9%
GR00T N1.543.0%9.6%4.4%20.0%
GR00T N1.5 + pretraining + 100% target data51.1%

Sim-to-real: Sim+real training → 79.8% vs real-only baseline → 61.8% (+18pp)

Pretraining data efficiency: ~3x improvement with foundation model pretraining vs. training from scratch on target data only.

Limitations

  • Author-stated: Limited to kitchen environments; generalization findings may not transfer to other household settings
  • Author-stated: Simulation cannot capture full sensory and physical complexity of real-world deployment
  • Author-stated: Sim-to-real gap “remains a significant challenge”
  • Unstated: Even the best model (GR00T N1.5 fine-tuned) achieves only 51.1% average — composite-unseen tasks remain largely unsolved (4.4% is low)
  • Unstated: MimicGen synthetic data showed “mixed results” — large mixed-quality datasets don’t reliably improve generalization, suggesting data quality matters more than quantity

Reproducibility

  • Code/Data: Open-sourced at https://robocasa.ai; all trained models released
  • Datasets: 612h human demos (55k episodes) + 1,615h MimicGen synthetic; 2,500 kitchen scenes
  • Compute: Training details not specified in available content; GR00T N1.5 is a large foundation model requiring significant GPU resources

Insights

The most important negative result: even with 2,000+ hours of data and the best available foundation models, composite-unseen task performance remains near-floor (1–4%). This suggests current approaches — even at scale — have not solved compositional generalization for robot manipulation. The 3x pretraining efficiency gain is encouraging but the absolute numbers tell a more sobering story.

The sim-to-real result (+18pp) is practically significant: it means large-scale simulation data is genuinely useful for real-world deployment, not just benchmarking. This validates investing in simulation infrastructure for labs that lack extensive real-world data collection capacity.

Connections