本文由 AI 分析生成
建立時間: 2026-03-23 來源: https://x.com/tongzhou_mu/status/2035044458811859303
Summary
Tongzhou Mu’s detailed taxonomy of how video world models are being integrated into robot control, written in the context of GTC 2026 buzz around “world models.” Two dominant paradigms emerge: using video models as simulators (for data synthesis, planning, and evaluation) versus using them directly as policies (generating actions alongside or from video). Each paradigm has sub-approaches with distinct tradeoffs around compute cost, hallucination risk, and real-time feasibility.
Tongzhou Mu 針對影片世界模型在機器人控制中的整合方式提出詳細分類。兩大主流範式:將影片模型用作模擬器(資料合成、規劃、評估),以及直接用作策略(從影片生成或聯合生成動作)。每種範式各有子方法,在計算成本、幻覺風險與即時可行性上各有取捨。最後指出,DVA 模型實現全影片去噪的即時生成是一個特別值得關注的突破。
Key Points
- Paradigm 1 — Video as Simulator: model predicts future states to synthesize training data (DreamGen, GR00T N1), plan at inference time (V-JEPA 2, Cosmos Policy), or evaluate policies before hardware deployment (Veo Robotics)
- Paradigm 2 — Video as Policy: model directly produces control signals via joint action decoding (GR-1, GR-2), visual representation extraction (VPDD, UWM), or open/closed-loop video-to-action translation (UniPi, DVA)
- Closed-loop generation solves the hallucination problem of open-loop by replacing generated frames with real observations at each step; DVA achieves this with full denoising at real-time speed
- Both paradigms ultimately aim to give robots “physical common sense” from internet-scale video pretraining rather than hand-coded physics
Insights
- The closed-loop video-to-action approach (Paradigm 2.4) is the most architecturally novel: it reframes robot control entirely as a real-time video generation problem, letting the robot directly benefit from video foundation model scaling without adapting the architecture
- There is a tension throughout: Paradigm 1 (simulator) preserves a familiar control loop but adds complexity; Paradigm 2 (policy) is more radical but faces harder engineering constraints for real-time use
- The 36 cited papers across both paradigms indicate the field is sprawling rapidly — this thread is an unusually useful map written by someone close to the research front
Connections
- Clippings-thread-by-tongzhou-mu — the earlier thread by the same author covering the high-level two-paradigm split
- Clippings-gr-dexter-bimanual-dexterous-vla — GR series VLA models mentioned in Paradigm 2.1
- Clippings-state-of-vla-research-iclr-2026 — broader VLA landscape at ICLR 2026
- Clippings-pi-research-rlt — complementary approach to VLA refinement via on-robot RL
Raw Excerpt
“By reformulating robot control as a challenge of real-time video generation, we may be on the verge of a new scaling law for embodied intelligence.”