Video World Models for Robotics: Two Paradigms — Detailed Breakdown (Tongzhou Mu)

本文由 AI 分析生成

建立時間： 2026-03-23 來源： https://x.com/tongzhou_mu/status/2035044458811859303

Summary

Tongzhou Mu’s detailed taxonomy of how video world models are being integrated into robot control, written in the context of GTC 2026 buzz around “world models.” Two dominant paradigms emerge: using video models as simulators (for data synthesis, planning, and evaluation) versus using them directly as policies (generating actions alongside or from video). Each paradigm has sub-approaches with distinct tradeoffs around compute cost, hallucination risk, and real-time feasibility.

Tongzhou Mu 針對影片世界模型在機器人控制中的整合方式提出詳細分類。兩大主流範式：將影片模型用作模擬器（資料合成、規劃、評估），以及直接用作策略（從影片生成或聯合生成動作）。每種範式各有子方法，在計算成本、幻覺風險與即時可行性上各有取捨。最後指出，DVA 模型實現全影片去噪的即時生成是一個特別值得關注的突破。

Key Points

Paradigm 1 — Video as Simulator: model predicts future states to synthesize training data (DreamGen, GR00T N1), plan at inference time (V-JEPA 2, Cosmos Policy), or evaluate policies before hardware deployment (Veo Robotics)
Paradigm 2 — Video as Policy: model directly produces control signals via joint action decoding (GR-1, GR-2), visual representation extraction (VPDD, UWM), or open/closed-loop video-to-action translation (UniPi, DVA)
Closed-loop generation solves the hallucination problem of open-loop by replacing generated frames with real observations at each step; DVA achieves this with full denoising at real-time speed
Both paradigms ultimately aim to give robots “physical common sense” from internet-scale video pretraining rather than hand-coded physics

Insights

The closed-loop video-to-action approach (Paradigm 2.4) is the most architecturally novel: it reframes robot control entirely as a real-time video generation problem, letting the robot directly benefit from video foundation model scaling without adapting the architecture
There is a tension throughout: Paradigm 1 (simulator) preserves a familiar control loop but adds complexity; Paradigm 2 (policy) is more radical but faces harder engineering constraints for real-time use
The 36 cited papers across both paradigms indicate the field is sprawling rapidly — this thread is an unusually useful map written by someone close to the research front

Connections

Clippings-thread-by-tongzhou-mu — the earlier thread by the same author covering the high-level two-paradigm split
Clippings-gr-dexter-bimanual-dexterous-vla — GR series VLA models mentioned in Paradigm 2.1
Clippings-state-of-vla-research-iclr-2026 — broader VLA landscape at ICLR 2026
Clippings-pi-research-rlt — complementary approach to VLA refinement via on-robot RL

Raw Excerpt

“By reformulating robot control as a challenge of real-time video generation, we may be on the verge of a new scaling law for embodied intelligence.”

bot_vault

Explorer

Video World Models for Robotics: Two Paradigms — Detailed Breakdown (Tongzhou Mu)

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks