Summary

A 50-page survey from NUS, Zhejiang University, and Horizon Robotics (Sept 2025, v3 Dec 2025) that fills a gap in prior survey coverage: while existing surveys focus on 2D image/video generative methods, this work specifically addresses native 3D/4D world modeling using occupancy grids, LiDAR point clouds, and neural representations (NeRF, 3DGS). The authors propose a taxonomy spanning VideoGen, OccGen, and LiDARGen paradigms, with datasets and evaluation metrics tailored to geometric modalities. Applications cover robotics, autonomous driving, XR, and digital twins.

2025年9月的系統性綜述,專門填補現有世界模型研究對原生 3D/4D 信號(佔用格、LiDAR、NeRF、3DGS)的忽視。提出 VideoGen / OccGen / LiDARGen 三大範式分類法,涵蓋資料引擎、動作解譯器、神經模擬器、場景重建器四個功能類型,應用涵蓋機器人、自駕、XR 及數位孿生。

Prerequisites

  • Neural Radiance Fields (NeRF) — implicit neural representation of 3D scenes from posed images
  • 3D Gaussian Splatting (3DGS) — explicit 3D representation using learnable Gaussian primitives
  • Occupancy grids — voxel-space binary/probabilistic scene encoding
  • Diffusion models — generative backbone used across VideoGen, OccGen, LiDARGen
  • Action-conditioned generation — conditioning on robot/agent actions to predict future states

Core Idea

The central argument: native 3D/4D signals “encode metric geometry, visibility, and motion in the coordinates where physics acts,” making them fundamentally superior to 2D projections for safety-critical applications. The survey organizes world models along two axes:

Representation axis (what is modeled):

  • Video streams — temporal RGB sequences with geometric coherence
  • Occupancy grids — voxelized scene occupation
  • LiDAR point clouds — direct metric geometry
  • Neural representations — NeRF and Gaussian splatting implicit/explicit encodings

Conditioning axis (what drives generation):

  • 𝒞_geo: camera poses, depth maps, HD maps
  • 𝒞_act: trajectories, control commands, navigation goals
  • 𝒞_sem: text prompts, scene graphs, object labels

Functional taxonomy (what the model does):

  1. Data Engines — synthesize diverse 3D/4D scenarios for augmentation
  2. Action Interpreters — forecast future states under action conditions
  3. Neural Simulators — closed-loop agent-environment interactions
  4. Scene Reconstructors — recover complete scenes from partial observations

Results

Method CategoryRepresentative WorkKey Capability
VideoGenMagicDrive, DriveDreamerBEV/geometry-conditioned video synthesis
VideoGen (closed-loop)DriveArenaTraffic synthesis + autoregressive generation
VideoGen (reconstruction)StreetGaussianNeRF/3DGS scene reconstruction
OccGen (3D)SSD, XCubeLatent diffusion for 3D occupancy
OccGen (4D)FF4D, OccWorldTemporal occupancy forecasting
OccGen (hybrid)WoVoGenOccupancy + video joint generation
OccGen (LLM-scale)OccLLaMA, OccSoraTransformer-based large-scale occupancy
LiDARGenPoint cloud generation and forecasting

Evaluation dimensions: generation quality (fidelity/diversity), forecasting quality (prediction accuracy), planning-centric quality (downstream task performance), reconstruction-centric quality (geometric accuracy).

Limitations

Author-stated:

  • Standardized benchmarking across modalities is lacking
  • Long-horizon generation quality degrades
  • Physical fidelity and controllability remain unsolved
  • Real-time computational efficiency is a bottleneck
  • Cross-modal generation coherence is difficult to maintain

Reviewer-identified:

  • Survey scope is primarily autonomous driving-heavy; robotics manipulation coverage thinner than the title implies
  • LiDAR section appears less developed than VideoGen/OccGen coverage
  • Geometry-to-safety pipeline (from reconstruction to safety constraint enforcement) is framed as future work, not a solved problem

Reproducibility

  • Code: GitHub repository for systematic literature summary available (linked in paper)
  • Datasets: Catalogued in Section 4 with standardized benchmark descriptions
  • Compute: Survey only; no training experiments reported

Insights

The key framing for safety research: the survey’s core claim that 3D/4D signals encode physics-space geometry is the theoretical justification for using 3D/4D reconstruction (rather than latent video) as a world model for safety evaluation. If a robot policy can be evaluated in a model that operates in metric geometry coordinates, collision detection and safety constraints can be computed analytically rather than inferred from VLM interpretation of 2D rollout videos.

The four-way functional taxonomy (Data Engine / Action Interpreter / Neural Simulator / Scene Reconstructor) provides clean vocabulary for positioning new work. A safety-focused pre-execution evaluator would be an “Action Interpreter” that feeds into a safety constraint checker, which is currently a gap in the surveyed literature.

The OccGen category (occupancy-based 4D prediction) is the most directly relevant to collision safety: voxel-space occupancy maps allow exact geometric collision queries without neural rendering overhead.

Connections