本文由 AI 分析生成
建立時間: 2026-03-28 來源: https://knightnemo.github.io/blog/posts/wm_2025/
Summary
Siqiao Huang (Nemo), an embodied AI researcher, shares a grounded assessment of world model research in late 2025 — after Genie 3’s viral success triggered a wave of competing releases. He argues that pixel-space video generation will dominate over 3D mesh approaches, that world models are more likely to be byproducts of video generation than standalone foundations, and that five specific research directions remain genuinely open.
2025 年 Genie 3 爆紅後,具身 AI 研究者 Huang Siqiao 從技術角度冷靜分析世界模型的現況:影像空間方法優於 3D mesh,世界模型更可能是影片生成的副產品,並提出五個值得深耕的研究方向。
Key Points
- Pixel vs 3D: video-based world models scale better than 3D mesh due to data abundance; 3D remains dominant only in contact-rich or depth-dependent scenarios
- Not the next LLM: two blockers — (1) action-labeled video data is far scarcer than raw video; (2) heterogeneous action spaces prevent a unified foundation model across embodiments
- The real “Next Big Thing”: multimodal video generation with world models as a subsidiary product, not the primary goal
- Algorithmic convergence: Diffusion Forcing + DiT/UNet architectures + AdaLN action injection is already the dominant pattern — the architecture race is largely over
- Five open directions: (1) deploying WMs for embodied policy learning; (2) long-sequence temporal consistency (minutes of memory); (3) multimodal signal integration (language + sensorimotor); (4) real-time inference; (5) multi-agent world models
- JEPA assessment: the latent-space learning idea is sound and influential, but JEPA-style architectures may not be the final form — DINO-based encoders + predictors are a more practical current direction
- Embodied AI outlook: world models will supplement but not replace real-world imitation learning; the next era should revolve around generalist policy models (VLA-style), not world models
Insights
The action space heterogeneity problem is underappreciated. LLMs succeeded because tokens are a universal representation — language, code, and structured data all map cleanly to tokens. World models lack an equivalent unifying abstraction for actions: a robot arm’s joint angles, a game controller’s button presses, and a car’s steering wheel are incommensurable in ways that token IDs are not. Any “foundation world model” claim needs to address this directly.
The Bitter Lesson framing is used to dismiss physics-informed world models: “learning general models through physics-informed methods is completely wrong.” This is a strong claim that conflates generalization with performance — physics priors help in specific domains (contact-rich manipulation) even if they hurt in others.
The multi-agent world model gap is genuinely underexplored. Current WMs are single-agent by design; multiplayer environments require modeling other agents’ policies, which scales combinatorially with the number of players.
Connections
Raw Excerpt
The Heterogeneity of Action Spaces: action spaces across different embodiments inherently lack homogeneity. A world model without a unified action space cannot become a ready-to-use foundation model, and more research breakthroughs are needed before realizing a foundational world model across embodiments.