Robotics World Model Reading Club 01 — Explicit 3D Backbone, Unified World Models

本文由 AI 分析生成

建立時間： 2026-03-30 來源： https://x.com/junfanzhu98/status/2038153945219305812

Summary

A detailed recap of the first San Francisco Robotics World Model Reading Club (March 28, 2026), authored by Junfan Zhu. The thread synthesizes the paradigm shift from policy learning (VLA: observation → action) to world model learning (WAM: latent world → future trajectories → actions). The central bottleneck identified is the absence of a unified latent interface that aligns perception, geometry, physics, and action — and the field’s current fragmentation across these dimensions.

機器人世界模型閱讀俱樂部第一期舊金山場的詳細回顧。核心論點：機器人學正從學習策略轉向學習世界模型，但目前缺乏統一的潛在介面來對齊感知、幾何、物理與動作。主要瓶頸不是模型規模，而是表示、資料與物理模擬的根本性缺口。

Key Points

VLA → WAM 範式轉移：VLA 是 observation → action 的直接映射；WAM（World Action Model）是 latent world → future trajectories → controllable actions，強調可控模擬而非反應式策略。NVIDIA Gr00t N2 是目前最強的端到端範例
顯式 3D 作為骨幹：像素表示高度冗餘、不具幾何感知，新方向是點雲/網格 + 物件中心子物件表示 + 幾何感知選擇性追蹤（接觸點、可操作性、關節部件）
D4RT 統一潛在空間：動態 4D 重建與追蹤系統，透過統一前饋 Transformer 架構將深度、時空對應、相機參數聯合推斷，達到 300× 加速，可用於實時機器人應用
Sim2Real 的物理差距：主要障礙不是視覺，而是物理——不連續接觸動力學、可形變物體高自由度、非可微摩擦。「仿真就緒」資料缺乏統一定義
FastWAM 推理優化：訓練時保留視訊協同訓練，測試時跳過未來預測（rollout）。「控制只需要選擇可行軌跡，不需要完整建模未來分佈」，大幅降低延遲而不損性能
資料瓶頸：機器人界沒有類似網路的資料飛輪，存在「十萬年資料缺口」；跨體現學習需要對齊座標系、動作空間、運動學約束
核心結論：瓶頸不在模型規模，而在統一表示、資料飛輪、推理與控制的不匹配、未解決的物理問題以及碎片化的體現

Insights

「現實無法像網路一樣被爬取，必須被感知、互動與模擬。」這句話精確描述了機器人學習與 LLM scaling 的根本差異——LLM 的訓練資料已存在於網路，機器人的訓練資料必須被主動產生。

FastWAM 的設計哲學值得關注：推理 ≠ 控制。WAM 在訓練時需要世界模型，但在控制時只需要從已學到的潛在空間中採樣可行軌跡。這個「訓練時的任務與推理時的任務可以不同」的原則，與 speculative decoding 等技術背後的邏輯類似。

凸分解（convex decomposition）作為幾何重建到物理模擬的橋樑是一個低調但重要的工程洞見：它把開放世界的 3D 形狀分解為凸體聯合，讓碰撞代理變得高效，加速模擬器中的碰撞檢測約 5 倍。

Connections

Clippings-an-anatomy-of-vision-language-action-models-from-modules-to-milestones-and-chall — VLA 架構的系統性回顧，與本文 VLA → WAM 轉移的論點形成對照
world-models
sim2real
embodied-ai
representation-learning

Raw Excerpt

“Reality cannot be scraped like the internet. It must be sensed, interacted with, and simulated.”

bot_vault

Explorer

Robotics World Model Reading Club 01 — Explicit 3D Backbone, Unified World Models

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks