TesserAct: Learning 4D Embodied World Models

本文由 AI 分析生成

建立時間： 2025-04-29 來源： https://arxiv.org/abs/2504.20995

Summary

TesserAct (UMass/HKUST/Harvard, ICCV 2025) learns 4D embodied world models by extending video generation to predict RGB+Depth+Normal jointly — turning a 2D video diffusion model into a geometric world model without requiring explicit 3D optimization. The resulting model can reconstruct coherent 4D scenes from predicted frames and synthesize novel viewpoints at ~1 min inference time (vs. ~2 hours for NeRF approaches).

TesserAct 透過讓影片生成模型同時預測 RGB、深度與法向量（RGB-DN），以低計算成本獲得幾何感知的 4D 世界模型，在 RLBench 九項操作任務中超越影片規劃基線，並支援 1 分鐘內的新視角合成。

Prerequisites

Video diffusion models (CogVideoX) — base architecture fine-tuned in this work
Depth estimation from monocular video — used to augment training data
Surface normal estimation — second geometric channel; combined with depth for 3D coherence
RLBench manipulation benchmark — primary evaluation environment

Core Idea

The insight is that depth and surface normal information, when predicted alongside RGB frames, gives the world model implicit 3D grounding without the overhead of explicit point cloud or Gaussian generation. Two novel loss functions enforce temporal-spatial coherence:

Consistency loss — ensures predicted depth/normals are temporally stable across frames
Regularization loss — prevents depth/normal collapse in low-texture regions

Dataset strategy: Rather than collecting new robot data, the authors augment existing datasets (RT1 Fractal, Bridge, SomethingSomethingV2) with estimated depth/normals at scale, creating a ~285k video RGB-DN dataset. This data-centric approach is the primary scalability mechanism.

Results

Task	TesserAct	Best Baseline	Delta
RLBench avg success	41-88% per task	Image BC / video planning	Best overall
4D Chamfer distance	Lowest	OpenSora, CogVideoX, 4D-PointE	Geometric accuracy
Novel view CLIP Score	83.02	—	~1 min inference

Limitations

Author-stated:

Single-surface reconstruction only — occluded geometry not modeled
Multi-view RGB-DN generation identified as future work

Reviewer-identified:

Depth/normal estimation quality degrades for transparent or reflective objects
The model predicts geometry indirectly (via DN channels) rather than tracking 3D object states; collision queries still require an additional geometric processing step on top of the predicted depth maps
No explicit action space grounding — actions condition generation but don’t constrain predicted geometry to be physically consistent with the action’s kinematic effects

Reproducibility

Code: GitHub repository available (UMass-Embodied-AGI/TesserAct)
Dataset: Assembled from public sources (RT1, Bridge, SSv2) + RLBench synthetic
Compute: CogVideoX fine-tuning; scale not specified in abstract

Insights

TesserAct is the practical path toward 4D world models for researchers without large compute budgets: leveraging existing video diffusion checkpoints and existing robot datasets augmented with estimated geometry. The RGB-DN representation is a clever middle ground — geometric enough for better spatial reasoning, cheap enough to train at scale.

For safety evaluation, the key limitation is that TesserAct predicts depth maps (per-pixel geometry), not an explicit object-level 3D representation. A collision check on TesserAct output would require: predict RGB-DN sequence → convert depth maps to point clouds → run collision geometry query on point cloud trajectory. This pipeline is possible but adds latency and error propagation compared to GWM’s Gaussian primitive approach.

Connections

gwm-gaussian-world-models-robotic-manipulation — complementary work: GWM uses explicit 3DGS primitives instead of predicted depth maps
geometry-aware-4d-video-generation-robot-manipulation — related ICLR 2026 work with cross-view pointmap alignment
3d-4d-world-modeling-survey-2509.07996 — survey that contextualizes TesserAct in the OccGen/VideoGen taxonomy
world-models-robot-safety — safety evaluation pipeline that could consume TesserAct’s RGB-DN predictions
world-models
4d-reconstruction

bot_vault

Explorer

TesserAct: Learning 4D Embodied World Models

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Graph View

Table of Contents

Backlinks