RoboCasa365: Large-Scale Simulation for Training and Benchmarking Generalist Robots

本文由 AI 分析生成

建立時間： 2026-03-27 來源： https://arxiv.org/abs/2603.04356

Summary

RoboCasa365 is a large-scale kitchen simulation benchmark for training and evaluating generalist household robots. It provides 365 tasks (65 atomic + 300 composite) across 2,500 kitchen scenes, paired with 2,000+ hours of demonstration data (612h human + 1,615h synthetic via MimicGen). Accepted to ICLR 2026. Key finding: foundation model pretraining on diverse tasks yields ~3x data efficiency improvement on downstream tasks.

RoboCasa365 是一個大規模廚房模擬基準，提供 365 個任務（65 個原子 + 300 個複合）跨 2,500 個廚房場景，配備 2,000+ 小時示範資料（612 小時人類 + 1,615 小時 MimicGen 合成）。核心發現：在多樣化任務上進行基礎模型預訓練可帶來約 3 倍的下游任務數據效率提升。

Prerequisites

Mobile manipulation — many tasks (220/365) require combined base navigation + arm manipulation; understanding this distinction is key to interpreting the results split
Behavior cloning / imitation learning — all baselines are trained via supervised learning on demonstrations; BC is the baseline against which foundation models are compared
MimicGen — the synthetic data generation pipeline used to scale from human demos to 1,615h of training data; knowing it auto-generates trajectories from a small seed set explains the “mixed-quality” concern
π₀ / GR00T N1.5 — the foundation model baselines; familiarity with VLA-style models helps contextualize the 20-51% performance range

Core Idea

The core contribution is a benchmark that stresses compositional generalization — combining atomic skills in unseen sequences — which is the primary failure mode of current robot policies. The 365-task design is deliberate: 65 atomic tasks that cover the skill primitives, and 300 composite tasks that chain them in training-seen and training-unseen combinations. The benchmark tests three paradigms simultaneously (multi-task, foundation model pretraining, lifelong learning), allowing apples-to-apples comparison. The sim-to-real gap is addressed by training on 2,500 diverse kitchen scenes; the real-world result (79.8% vs 61.8% for sim+real vs real-only) validates the sim data’s utility.

Results

Method	Atomic	Composite-Seen	Composite-Unseen	Avg
Diffusion Policy	15.7%	0.2%	1.25%	6.1%
π₀	36.3%	5.2%	0.7%	15.0%
π₀.₅	39.6%	7.1%	1.2%	16.9%
GR00T N1.5	43.0%	9.6%	4.4%	20.0%
GR00T N1.5 + pretraining + 100% target data	—	—	—	51.1%

Sim-to-real: Sim+real training → 79.8% vs real-only baseline → 61.8% (+18pp)

Pretraining data efficiency: ~3x improvement with foundation model pretraining vs. training from scratch on target data only.

Limitations

Author-stated: Limited to kitchen environments; generalization findings may not transfer to other household settings
Author-stated: Simulation cannot capture full sensory and physical complexity of real-world deployment
Author-stated: Sim-to-real gap “remains a significant challenge”
Unstated: Even the best model (GR00T N1.5 fine-tuned) achieves only 51.1% average — composite-unseen tasks remain largely unsolved (4.4% is low)
Unstated: MimicGen synthetic data showed “mixed results” — large mixed-quality datasets don’t reliably improve generalization, suggesting data quality matters more than quantity

Reproducibility

Code/Data: Open-sourced at https://robocasa.ai; all trained models released
Datasets: 612h human demos (55k episodes) + 1,615h MimicGen synthetic; 2,500 kitchen scenes
Compute: Training details not specified in available content; GR00T N1.5 is a large foundation model requiring significant GPU resources

Insights

The most important negative result: even with 2,000+ hours of data and the best available foundation models, composite-unseen task performance remains near-floor (1–4%). This suggests current approaches — even at scale — have not solved compositional generalization for robot manipulation. The 3x pretraining efficiency gain is encouraging but the absolute numbers tell a more sobering story.

The sim-to-real result (+18pp) is practically significant: it means large-scale simulation data is genuinely useful for real-world deployment, not just benchmarking. This validates investing in simulation infrastructure for labs that lack extensive real-world data collection capacity.

Quartz 5

Explorer

RoboCasa365: Large-Scale Simulation for Training and Benchmarking Generalist Robots

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Graph View

Table of Contents

Backlinks