Video World Models for Robotics: Two Paradigms (Tongzhou Mu Thread)

本文由 AI 分析生成

建立時間： 2026-03-22 來源： https://x.com/tongzhou_mu/status/2035044458811859303

Summary

Tongzhou Mu’s taxonomy of how video world models are being integrated into robot control, following the GTC 2026 buzz. Two paradigms: (1) using video models as simulators — for data synthesis, inference-time planning, and policy evaluation — and (2) using video models directly as policies — through joint video-action generation, visual representation extraction, or video-to-action translation pipelines. Closed-loop video generation at real-time speed (DVA, LingBot-VA) is identified as the most promising frontier.

Tongzhou Mu 對視頻世界模型整合進機器人控制的分類，呼應 GTC 2026 的熱潮。兩種典範：（1）將視頻模型用作模擬器——用於資料合成、推理時規劃和策略評估；（2）直接將視頻模型用作策略——通過聯合視頻動作生成、視覺表徵提取或視頻到動作翻譯管線。即時速度的閉環視頻生成（DVA、LingBot-VA）被識別為最有前景的前沿。

Key Points

Paradigm 1 — Video Model as Simulator: data synthesis (DreamGen, GR00T N1), inference-time planning (V-JEPA 2, Cosmos Policy), policy evaluation (Veo Robotics)
Paradigm 2 — Video Model as Policy: joint generation (GR-1/2, Cosmos Policy), visual representations (VPDD, UWM), open-loop video + inverse dynamics (UniPi, TesserAct), closed-loop video generation (DVA, mimic-video, LingBot-VA)
Closed-loop generation fixes open-loop hallucination problem: conditions on real observations each step, replaces generated frames with real ones
DVA achieves real-time full-denoising — pure noise → clean video at control frequency — considered a significant breakthrough
Core thesis: robot control reframed as real-time video generation → directly benefits from internet-scale video pretraining

Insights

The “video model as policy” paradigm is a natural extension of the VLA paradigm: VLAs use language+vision to predict actions; video-policy models use language+vision to predict future video, then extract actions from that video. The key insight is that video generation is what these models were pre-trained to do — so using them for direct generation rather than policy distillation is more aligned with their pretraining
Closed-loop video generation (replacing generated frames with real observations) solves a fundamental problem in model-based RL: the world model’s predictions compound in error over long horizons. Resetting with real observations each step bounds the error
The computational challenge (running video diffusion at control frequencies) is the same bottleneck as in VLA inference — discrete diffusion and KV caching (mentioned for LingBot-VA) are both being applied as solutions, suggesting convergence of the approaches
“Implicit physics from billions of web videos” vs. explicit physics simulation (IsaacSim, MuJoCo) is a deeper epistemological split: the former bets that enough video data contains sufficient physics grounding; the latter maintains explicit physical models. Real-world deployment will likely require both
This thread connects directly to the ICLR 2026 VLA survey (Trend #6: VLA + Video Prediction) and GR-Dexter (which uses a similar “video-conditioned” architecture philosophy)

Connections

Raw Excerpt

By reformulating robot control as a challenge of real-time video generation, we may be on the verge of a new scaling law for embodied intelligence.

bot_vault

Explorer

Video World Models for Robotics: Two Paradigms (Tongzhou Mu Thread)

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks