Summary

Tongzhou Mu (PhD student, CMU/UCSD) maps the landscape of video world models for robotics post-GTC 2026 into two dominant paradigms: (1) Video Model as Simulator — using predicted futures for data synthesis, planning, or policy evaluation; (2) Video Model as Policy — generating video and actions jointly, extracting visual representations, or doing open/closed-loop video-to-action translation.

Tongzhou Mu 將 GTC 2026 後的機器人視頻世界模型分為兩大範式:(1)視頻模型作為模擬器——用於資料合成、推理或策略評估;(2)視頻模型作為策略——聯合生成視頻和動作、提取視覺表示,或進行開放/封閉迴路的視頻轉動作。

Key Points

  • Paradigm 1 — Simulator: (1.1) synthesize data for policy training (DreamGen, GR00T N1); (1.2) imagination-based inference-time planning (V-JEPA 2, CLASP, Cosmos Policy); (1.3) policy evaluation before hardware contact (Veo Robotics)
  • Paradigm 2 — Policy: (2.1) joint video+action generation (DreamZero, GR-1/GR-2, PAD); (2.2) extract visual representations to guide action generation (VPDD, UWM, Video Policy); (2.3) open-loop video → inverse dynamics → actions (UniPi, TesserAct); (2.4) closed-loop real-time generation (DVA, mimic-video, LingBot-VA)
  • Simulator limitation: prediction accuracy ceiling — no free lunch, physics hallucination is the hard problem
  • DVA as a turning point: real-time full video denoising (not partial) from noise to clean video every step — reformulates control as real-time video generation
  • Key insight: both paradigms embed implicit physics from web-scale video pretraining, replacing manually coded physical laws

Insights

The closed-loop video generation + video-to-action translation (2.4) is the most technically ambitious direction because it avoids hallucination (by conditioning on real observations each step) while still fully leveraging video pretraining. DVA achieving real-time speeds with complete denoising is the key recent milestone. If this scales, it suggests robot control could benefit from the same data flywheel as video generation — every uploaded video becomes implicit robot training data.

Connections

Raw Excerpt

By reformulating robot control as a challenge of real-time video generation, we may be on the verge of a new scaling law for embodied intelligence.