GWM: Towards Scalable Gaussian World Models for Robotic Manipulation

本文由 AI 分析生成

建立時間： 2025-08-25 來源： https://arxiv.org/abs/2508.17600

Summary

GWM (ICCV 2025) is the first world model to use 3D Gaussian Splatting as its internal scene representation. A DiT in latent Gaussian space learns to predict how Gaussian primitives evolve under robot actions. This gives the world model an explicit 3D representation: unlike video diffusion models, GWM’s predicted future states are renderable from any viewpoint and, in principle, support analytic geometry queries on the predicted scene.

GWM（ICCV 2025）以 3D Gaussian Splatting 為世界模型的內部表示，用 Diffusion Transformer 在 Gaussian 潛空間中學習 action-conditioned 場景預測。在真實 Franka 機器人上比 Diffusion Policy 高 30%（65% vs 35%），是目前最直接用 3DGS 做機器人操作預測的工作。

Prerequisites

3D Gaussian Splatting (3DGS) — explicit scene representation using learnable Gaussian primitives
Splatt3R — feed-forward model converting unposed images to 3D Gaussians without calibration
Diffusion Transformer (DiT) — generative backbone operating in latent space
EDM preconditioning — training stabilization technique for diffusion models
MBPO (Model-Based Policy Optimization) — RL framework that GWM integrates with

Core Idea

Three-stage pipeline:

1. Scene → 3D Gaussians: Splatt3R converts RGB images to 3D Gaussian sets without camera calibration.

2. Gaussians → Latent: 3D Gaussian VAE compresses variable-size Gaussian sets to fixed-length latent vectors via cross-attention encoder; reconstructs via Transformer decoder. Trained with Chamfer loss + rendering loss.

3. Latent dynamics: DiT learns p(future Gaussian latent | history Gaussians, actions). Action conditioning via cross-attention. Generates future 3D Gaussian states that can be decoded and rendered.

Policy coupling:

Imitation Learning: first DiT denoising step features used as a rich state encoder
Model-based RL: GWM integrated as learned simulator within MBPO

Results

Metric	GWM	Baseline	Notes
IL avg (50 demos)	+10.5%	—	24 RoboCasa tasks
IL avg (3k demos)	+7.6%	—	24 RoboCasa tasks
MBRL convergence	2× faster	iVideoGPT	Meta-World
Real-world success	65%	35% (DP)	Franka FR3
Novel distractor	60%	0%	Generalization
3DGS only ablation	18%	—	Validates Gaussian repr.

Evaluated on: Meta-World (50 tasks), RoboCasa (24 kitchen tasks), Franka FR3 (real).

Limitations

Author-stated:

Depends on Splatt3R reconstruction quality; dynamic or transparent objects may fail
Real-world validation limited to single pick-and-place task variant
Compute cost vs. baselines not benchmarked

Reviewer-identified:

No explicit collision query interface built on top of the predicted Gaussian scene — safety evaluation still requires downstream geometry processing
Gaussian representation is view-based and may not fully capture internal object geometry needed for contact-rich manipulation
Unposed image assumption (Splatt3R) creates implicit calibration dependency in multi-camera setups

Reproducibility

Code: gaussian-world-model.github.io (project page, code availability stated)
Data: Meta-World (public), RoboCasa (public), Franka real data (custom)
Compute: ICCV 2025 scale; Splatt3R + DiT training

Insights

GWM is architecturally the most aligned with a 3D/4D reconstruction-based safety pipeline. Because future states are represented as 3D Gaussian sets, a safety evaluator could in principle:

Run GWM to predict future Gaussian states after a candidate action
Render predicted Gaussians to extract geometry (mesh/point cloud)
Run analytic collision detection on predicted geometry
Accept/reject the action before execution

This is the pipeline the 2409.07996 survey identifies as a gap (Action Interpreter → safety constraint checker). GWM provides the Action Interpreter piece; the safety constraint checker is still the missing link.

The ablation showing 3DGS alone achieves 18% success (vs. 65% with full GWM) quantifies exactly how much the generative dynamics model adds on top of pure reconstruction.

Connections

tesseract-learning-4d-embodied-world-models — complementary: TesserAct uses depth/normal prediction; GWM uses explicit Gaussian primitives
3d-4d-world-modeling-survey-2509.07996 — GWM is a prime example of the “Neural Simulator” functional category
semantic-metric-bayesian-risk-fields-vlm-robot-safety — safety evaluator that could sit downstream of GWM’s predicted scenes
safevla-safety-alignment-vla-constrained-learning — safety constraint side complementing GWM’s prediction capability
world-models-robot-safety — vault synthesis connecting world model prediction to safety
world-models
gaussian-splatting
3d-reconstruction

bot_vault

Explorer

GWM: Towards Scalable Gaussian World Models for Robotic Manipulation

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Graph View

Table of Contents

Backlinks