本文由 AI 分析生成
建立時間: 2025-04-29 來源: https://arxiv.org/abs/2504.20995
Summary
TesserAct (UMass/HKUST/Harvard, ICCV 2025) learns 4D embodied world models by extending video generation to predict RGB+Depth+Normal jointly — turning a 2D video diffusion model into a geometric world model without requiring explicit 3D optimization. The resulting model can reconstruct coherent 4D scenes from predicted frames and synthesize novel viewpoints at ~1 min inference time (vs. ~2 hours for NeRF approaches).
TesserAct 透過讓影片生成模型同時預測 RGB、深度與法向量(RGB-DN),以低計算成本獲得幾何感知的 4D 世界模型,在 RLBench 九項操作任務中超越影片規劃基線,並支援 1 分鐘內的新視角合成。
Prerequisites
- Video diffusion models (CogVideoX) — base architecture fine-tuned in this work
- Depth estimation from monocular video — used to augment training data
- Surface normal estimation — second geometric channel; combined with depth for 3D coherence
- RLBench manipulation benchmark — primary evaluation environment
Core Idea
The insight is that depth and surface normal information, when predicted alongside RGB frames, gives the world model implicit 3D grounding without the overhead of explicit point cloud or Gaussian generation. Two novel loss functions enforce temporal-spatial coherence:
- Consistency loss — ensures predicted depth/normals are temporally stable across frames
- Regularization loss — prevents depth/normal collapse in low-texture regions
Dataset strategy: Rather than collecting new robot data, the authors augment existing datasets (RT1 Fractal, Bridge, SomethingSomethingV2) with estimated depth/normals at scale, creating a ~285k video RGB-DN dataset. This data-centric approach is the primary scalability mechanism.
Results
| Task | TesserAct | Best Baseline | Delta |
|---|---|---|---|
| RLBench avg success | 41-88% per task | Image BC / video planning | Best overall |
| 4D Chamfer distance | Lowest | OpenSora, CogVideoX, 4D-PointE | Geometric accuracy |
| Novel view CLIP Score | 83.02 | — | ~1 min inference |
Limitations
Author-stated:
- Single-surface reconstruction only — occluded geometry not modeled
- Multi-view RGB-DN generation identified as future work
Reviewer-identified:
- Depth/normal estimation quality degrades for transparent or reflective objects
- The model predicts geometry indirectly (via DN channels) rather than tracking 3D object states; collision queries still require an additional geometric processing step on top of the predicted depth maps
- No explicit action space grounding — actions condition generation but don’t constrain predicted geometry to be physically consistent with the action’s kinematic effects
Reproducibility
- Code: GitHub repository available (UMass-Embodied-AGI/TesserAct)
- Dataset: Assembled from public sources (RT1, Bridge, SSv2) + RLBench synthetic
- Compute: CogVideoX fine-tuning; scale not specified in abstract
Insights
TesserAct is the practical path toward 4D world models for researchers without large compute budgets: leveraging existing video diffusion checkpoints and existing robot datasets augmented with estimated geometry. The RGB-DN representation is a clever middle ground — geometric enough for better spatial reasoning, cheap enough to train at scale.
For safety evaluation, the key limitation is that TesserAct predicts depth maps (per-pixel geometry), not an explicit object-level 3D representation. A collision check on TesserAct output would require: predict RGB-DN sequence → convert depth maps to point clouds → run collision geometry query on point cloud trajectory. This pipeline is possible but adds latency and error propagation compared to GWM’s Gaussian primitive approach.
Connections
- gwm-gaussian-world-models-robotic-manipulation — complementary work: GWM uses explicit 3DGS primitives instead of predicted depth maps
- geometry-aware-4d-video-generation-robot-manipulation — related ICLR 2026 work with cross-view pointmap alignment
- 3d-4d-world-modeling-survey-2509.07996 — survey that contextualizes TesserAct in the OccGen/VideoGen taxonomy
- world-models-robot-safety — safety evaluation pipeline that could consume TesserAct’s RGB-DN predictions
- world-models
- 4d-reconstruction