本文由 AI 分析生成
建立時間: 2025-08-25 來源: https://arxiv.org/abs/2508.17600
Summary
GWM (ICCV 2025) is the first world model to use 3D Gaussian Splatting as its internal scene representation. A DiT in latent Gaussian space learns to predict how Gaussian primitives evolve under robot actions. This gives the world model an explicit 3D representation: unlike video diffusion models, GWM’s predicted future states are renderable from any viewpoint and, in principle, support analytic geometry queries on the predicted scene.
GWM(ICCV 2025)以 3D Gaussian Splatting 為世界模型的內部表示,用 Diffusion Transformer 在 Gaussian 潛空間中學習 action-conditioned 場景預測。在真實 Franka 機器人上比 Diffusion Policy 高 30%(65% vs 35%),是目前最直接用 3DGS 做機器人操作預測的工作。
Prerequisites
- 3D Gaussian Splatting (3DGS) — explicit scene representation using learnable Gaussian primitives
- Splatt3R — feed-forward model converting unposed images to 3D Gaussians without calibration
- Diffusion Transformer (DiT) — generative backbone operating in latent space
- EDM preconditioning — training stabilization technique for diffusion models
- MBPO (Model-Based Policy Optimization) — RL framework that GWM integrates with
Core Idea
Three-stage pipeline:
1. Scene → 3D Gaussians: Splatt3R converts RGB images to 3D Gaussian sets without camera calibration.
2. Gaussians → Latent: 3D Gaussian VAE compresses variable-size Gaussian sets to fixed-length latent vectors via cross-attention encoder; reconstructs via Transformer decoder. Trained with Chamfer loss + rendering loss.
3. Latent dynamics: DiT learns p(future Gaussian latent | history Gaussians, actions). Action conditioning via cross-attention. Generates future 3D Gaussian states that can be decoded and rendered.
Policy coupling:
- Imitation Learning: first DiT denoising step features used as a rich state encoder
- Model-based RL: GWM integrated as learned simulator within MBPO
Results
| Metric | GWM | Baseline | Notes |
|---|---|---|---|
| IL avg (50 demos) | +10.5% | — | 24 RoboCasa tasks |
| IL avg (3k demos) | +7.6% | — | 24 RoboCasa tasks |
| MBRL convergence | 2× faster | iVideoGPT | Meta-World |
| Real-world success | 65% | 35% (DP) | Franka FR3 |
| Novel distractor | 60% | 0% | Generalization |
| 3DGS only ablation | 18% | — | Validates Gaussian repr. |
Evaluated on: Meta-World (50 tasks), RoboCasa (24 kitchen tasks), Franka FR3 (real).
Limitations
Author-stated:
- Depends on Splatt3R reconstruction quality; dynamic or transparent objects may fail
- Real-world validation limited to single pick-and-place task variant
- Compute cost vs. baselines not benchmarked
Reviewer-identified:
- No explicit collision query interface built on top of the predicted Gaussian scene — safety evaluation still requires downstream geometry processing
- Gaussian representation is view-based and may not fully capture internal object geometry needed for contact-rich manipulation
- Unposed image assumption (Splatt3R) creates implicit calibration dependency in multi-camera setups
Reproducibility
- Code: gaussian-world-model.github.io (project page, code availability stated)
- Data: Meta-World (public), RoboCasa (public), Franka real data (custom)
- Compute: ICCV 2025 scale; Splatt3R + DiT training
Insights
GWM is architecturally the most aligned with a 3D/4D reconstruction-based safety pipeline. Because future states are represented as 3D Gaussian sets, a safety evaluator could in principle:
- Run GWM to predict future Gaussian states after a candidate action
- Render predicted Gaussians to extract geometry (mesh/point cloud)
- Run analytic collision detection on predicted geometry
- Accept/reject the action before execution
This is the pipeline the 2409.07996 survey identifies as a gap (Action Interpreter → safety constraint checker). GWM provides the Action Interpreter piece; the safety constraint checker is still the missing link.
The ablation showing 3DGS alone achieves 18% success (vs. 65% with full GWM) quantifies exactly how much the generative dynamics model adds on top of pure reconstruction.
Connections
- tesseract-learning-4d-embodied-world-models — complementary: TesserAct uses depth/normal prediction; GWM uses explicit Gaussian primitives
- 3d-4d-world-modeling-survey-2509.07996 — GWM is a prime example of the “Neural Simulator” functional category
- semantic-metric-bayesian-risk-fields-vlm-robot-safety — safety evaluator that could sit downstream of GWM’s predicted scenes
- safevla-safety-alignment-vla-constrained-learning — safety constraint side complementing GWM’s prediction capability
- world-models-robot-safety — vault synthesis connecting world model prediction to safety
- world-models
- gaussian-splatting
- 3d-reconstruction