本文由 AI 分析生成
建立時間: 2026-03-22 來源: https://x.com/tongzhou_mu/status/2035044458811859303
Summary
Tongzhou Mu’s taxonomy of how video world models are being integrated into robot control, following the GTC 2026 buzz. Two paradigms: (1) using video models as simulators — for data synthesis, inference-time planning, and policy evaluation — and (2) using video models directly as policies — through joint video-action generation, visual representation extraction, or video-to-action translation pipelines. Closed-loop video generation at real-time speed (DVA, LingBot-VA) is identified as the most promising frontier.
Tongzhou Mu 對視頻世界模型整合進機器人控制的分類,呼應 GTC 2026 的熱潮。兩種典範:(1)將視頻模型用作模擬器——用於資料合成、推理時規劃和策略評估;(2)直接將視頻模型用作策略——通過聯合視頻動作生成、視覺表徵提取或視頻到動作翻譯管線。即時速度的閉環視頻生成(DVA、LingBot-VA)被識別為最有前景的前沿。
Key Points
- Paradigm 1 — Video Model as Simulator: data synthesis (DreamGen, GR00T N1), inference-time planning (V-JEPA 2, Cosmos Policy), policy evaluation (Veo Robotics)
- Paradigm 2 — Video Model as Policy: joint generation (GR-1/2, Cosmos Policy), visual representations (VPDD, UWM), open-loop video + inverse dynamics (UniPi, TesserAct), closed-loop video generation (DVA, mimic-video, LingBot-VA)
- Closed-loop generation fixes open-loop hallucination problem: conditions on real observations each step, replaces generated frames with real ones
- DVA achieves real-time full-denoising — pure noise → clean video at control frequency — considered a significant breakthrough
- Core thesis: robot control reframed as real-time video generation → directly benefits from internet-scale video pretraining
Insights
- The “video model as policy” paradigm is a natural extension of the VLA paradigm: VLAs use language+vision to predict actions; video-policy models use language+vision to predict future video, then extract actions from that video. The key insight is that video generation is what these models were pre-trained to do — so using them for direct generation rather than policy distillation is more aligned with their pretraining
- Closed-loop video generation (replacing generated frames with real observations) solves a fundamental problem in model-based RL: the world model’s predictions compound in error over long horizons. Resetting with real observations each step bounds the error
- The computational challenge (running video diffusion at control frequencies) is the same bottleneck as in VLA inference — discrete diffusion and KV caching (mentioned for LingBot-VA) are both being applied as solutions, suggesting convergence of the approaches
- “Implicit physics from billions of web videos” vs. explicit physics simulation (IsaacSim, MuJoCo) is a deeper epistemological split: the former bets that enough video data contains sufficient physics grounding; the latter maintains explicit physical models. Real-world deployment will likely require both
- This thread connects directly to the ICLR 2026 VLA survey (Trend #6: VLA + Video Prediction) and GR-Dexter (which uses a similar “video-conditioned” architecture philosophy)
Connections
- State of VLA Research at ICLR 2026
- GR-Dexter: VLA for Bimanual Dexterous VLA
- Vision-Language-Action Models
- Diffusion Models
- Embodied AI
- World Models
- Reinforcement Learning
Raw Excerpt
By reformulating robot control as a challenge of real-time video generation, we may be on the verge of a new scaling law for embodied intelligence.