3D and 4D World Modeling: A Survey

本文由 AI 分析生成

建立時間： 2025-09-04 來源： https://arxiv.org/abs/2509.07996

Summary

A 50-page survey from NUS, Zhejiang University, and Horizon Robotics (Sept 2025, v3 Dec 2025) that fills a gap in prior survey coverage: while existing surveys focus on 2D image/video generative methods, this work specifically addresses native 3D/4D world modeling using occupancy grids, LiDAR point clouds, and neural representations (NeRF, 3DGS). The authors propose a taxonomy spanning VideoGen, OccGen, and LiDARGen paradigms, with datasets and evaluation metrics tailored to geometric modalities. Applications cover robotics, autonomous driving, XR, and digital twins.

2025年9月的系統性綜述，專門填補現有世界模型研究對原生 3D/4D 信號（佔用格、LiDAR、NeRF、3DGS）的忽視。提出 VideoGen / OccGen / LiDARGen 三大範式分類法，涵蓋資料引擎、動作解譯器、神經模擬器、場景重建器四個功能類型，應用涵蓋機器人、自駕、XR 及數位孿生。

Prerequisites

Neural Radiance Fields (NeRF) — implicit neural representation of 3D scenes from posed images
3D Gaussian Splatting (3DGS) — explicit 3D representation using learnable Gaussian primitives
Occupancy grids — voxel-space binary/probabilistic scene encoding
Diffusion models — generative backbone used across VideoGen, OccGen, LiDARGen
Action-conditioned generation — conditioning on robot/agent actions to predict future states

Core Idea

The central argument: native 3D/4D signals “encode metric geometry, visibility, and motion in the coordinates where physics acts,” making them fundamentally superior to 2D projections for safety-critical applications. The survey organizes world models along two axes:

Representation axis (what is modeled):

Video streams — temporal RGB sequences with geometric coherence
Occupancy grids — voxelized scene occupation
LiDAR point clouds — direct metric geometry
Neural representations — NeRF and Gaussian splatting implicit/explicit encodings

Conditioning axis (what drives generation):

𝒞_geo: camera poses, depth maps, HD maps
𝒞_act: trajectories, control commands, navigation goals
𝒞_sem: text prompts, scene graphs, object labels

Functional taxonomy (what the model does):

Data Engines — synthesize diverse 3D/4D scenarios for augmentation
Action Interpreters — forecast future states under action conditions
Neural Simulators — closed-loop agent-environment interactions
Scene Reconstructors — recover complete scenes from partial observations

Results

Method Category	Representative Work	Key Capability
VideoGen	MagicDrive, DriveDreamer	BEV/geometry-conditioned video synthesis
VideoGen (closed-loop)	DriveArena	Traffic synthesis + autoregressive generation
VideoGen (reconstruction)	StreetGaussian	NeRF/3DGS scene reconstruction
OccGen (3D)	SSD, XCube	Latent diffusion for 3D occupancy
OccGen (4D)	FF4D, OccWorld	Temporal occupancy forecasting
OccGen (hybrid)	WoVoGen	Occupancy + video joint generation
OccGen (LLM-scale)	OccLLaMA, OccSora	Transformer-based large-scale occupancy
LiDARGen	—	Point cloud generation and forecasting

Evaluation dimensions: generation quality (fidelity/diversity), forecasting quality (prediction accuracy), planning-centric quality (downstream task performance), reconstruction-centric quality (geometric accuracy).

Limitations

Author-stated:

Standardized benchmarking across modalities is lacking
Long-horizon generation quality degrades
Physical fidelity and controllability remain unsolved
Real-time computational efficiency is a bottleneck
Cross-modal generation coherence is difficult to maintain

Reviewer-identified:

Survey scope is primarily autonomous driving-heavy; robotics manipulation coverage thinner than the title implies
LiDAR section appears less developed than VideoGen/OccGen coverage
Geometry-to-safety pipeline (from reconstruction to safety constraint enforcement) is framed as future work, not a solved problem

Reproducibility

Code: GitHub repository for systematic literature summary available (linked in paper)
Datasets: Catalogued in Section 4 with standardized benchmark descriptions
Compute: Survey only; no training experiments reported

Insights

The key framing for safety research: the survey’s core claim that 3D/4D signals encode physics-space geometry is the theoretical justification for using 3D/4D reconstruction (rather than latent video) as a world model for safety evaluation. If a robot policy can be evaluated in a model that operates in metric geometry coordinates, collision detection and safety constraints can be computed analytically rather than inferred from VLM interpretation of 2D rollout videos.

The four-way functional taxonomy (Data Engine / Action Interpreter / Neural Simulator / Scene Reconstructor) provides clean vocabulary for positioning new work. A safety-focused pre-execution evaluator would be an “Action Interpreter” that feeds into a safety constraint checker, which is currently a gap in the surveyed literature.

The OccGen category (occupancy-based 4D prediction) is the most directly relevant to collision safety: voxel-space occupancy maps allow exact geometric collision queries without neural rendering overhead.

Connections

geometry-aware-4d-video-generation-robot-manipulation — 4D video generation that enforces geometric consistency, represents the VideoGen approach
particleformer-3d-point-cloud-world-model-robot-manipulation — point-cloud-based 3D world model, maps to LiDARGen/Scene Reconstructor category
world-models-robot-safety — vault synthesis on safety applications, this survey fills the 3D/4D side
safevla-safety-alignment-vla-constrained-learning — safety constraint side that could consume 3D/4D world model outputs
semantic-metric-bayesian-risk-fields-vlm-robot-safety — closest existing work to using metric 3D for safety risk computation
world-models
3d-reconstruction
safe-rl

bot_vault

Explorer

3D and 4D World Modeling: A Survey

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Graph View

Table of Contents

Backlinks