Geometry-aware 4D Video Generation for Robot Manipulation

本文由 AI 分析生成

建立時間： 2026-04-06 來源： https://arxiv.org/abs/2507.01099

Summary

This paper proposes a 4D video generation framework that produces spatially and temporally consistent multi-view RGB-D sequences for robot manipulation tasks. The key innovation is supervising a video diffusion model (built on Stable Video Diffusion) with cross-view pointmap alignment during training — forcing the model to learn a shared 3D representation of the scene across camera views. At inference, it takes a single RGB-D image per view and generates future frames without requiring camera pose input; the outputs then feed into an off-the-shelf 6DoF pose tracker (FoundationPose) to extract end-effector trajectories.

本文提出以交叉視角點圖對齊（cross-view pointmap alignment）監督視訊擴散模型，使其生成 4D 一致的多視角 RGB-D 序列，並在不輸入相機姿態的情況下泛化至新視角。生成影片透過現成 6DoF 姿態追蹤器提取機器手末端軌跡，在模擬操縱任務中達成 64% 成功率，大幅超越基線（Dreamitate 9%、Diffusion Policy 12%）。

Prerequisites

Stable Video Diffusion (SVD) — 本文以此為影片生成骨幹，理解其 latent diffusion 機制有助於掌握幾何監督如何插入模型訓練流程
DUSt3R / pointmap representation — 論文的幾何一致性方法直接受 DUSt3R 啟發，pointmap（每像素 3D 座標）是核心資料結構
6DoF pose estimation — FoundationPose 用於從生成影片恢復末端執行器軌跡，理解剛體姿態追蹤對評估方法可行性至關重要
Imitation learning from video — 本文的機器人政策來自觀察生成影片再提取動作，屬 video-based imitation learning 範疇

Core Idea

傳統像素級影片生成缺乏跨視角的 3D 幾何約束，導致從不同相機角度觀察時出現不一致。本文的核心洞察是：在訓練時加入明確的幾何監督訊號（cross-view pointmap alignment loss），使模型在擴散過程中同時學習 RGB 外觀與場景 3D 結構。具體做法是讓兩個相機視角的 decoder 透過 cross-attention 交換資訊，並要求各視角的預測 pointmap 在同一世界座標系下對齊。這樣訓練出的模型不需要相機姿態作為推論輸入，卻能生成跨視角一致的 4D 序列，從而讓 off-the-shelf 姿態追蹤器可靠地從生成影片中提取機器手軌跡。

Results

Task / Benchmark	This work	Dreamitate	Diffusion Policy
StoreCerealBoxUnderShelf	~64% avg	9%	12%
PutSpatulaOnTable	~64% avg	9%	12%
PlaceAppleFromBowlIntoBin	~64% avg	9%	12%
Cross-view mIoU (w/ cross-attn)	0.70	—	—
Cross-view mIoU (w/o cross-attn)	0.41	—	—

Average success rate 64% across 3 simulated tasks; +55pp vs Dreamitate, +52pp vs Diffusion Policy.

Limitations

Author-stated: 需要多視角 RGB-D 資料集，真實世界採集困難（硬體與標定要求高）；推論速度慢（~30 秒 / 10 步），暫不適合即時部署
Unstated: 評估僅覆蓋 3 個模擬任務與 1 個真實任務，任務複雜度有限；成功率平均值背後各任務差異未詳述；依賴 FoundationPose 的準確性，若物件遮擋嚴重或紋理貧乏將產生瓶頸

Reproducibility

Code: 可取得 — https://github.com/lzylucy/4dgen
Datasets: 模擬環境自生成（16-20 demos/task），RGB-D + 多視角
Compute: 未明確說明，但基於 SVD fine-tuning，預估需要多 GPU（A100 級別）

Insights

相機姿態作為瓶頸的消解：傳統多視角生成方法需要精確相機參數，本文透過訓練時的幾何對齊學習隱式幾何，推論時省去標定需求，對真實部署影響深遠
DUSt3R 啟發的橋接：將 3D reconstruction 社群的 pointmap 表達引入機器人影片生成，是跨領域方法論轉移的典型案例
Video generation 作為 world model 的實用路徑：不以 3D 場景為直接操控對象，而是透過 2D 影片生成 + 幾何監督達成 3D 一致性，計算成本更低且可繼承大型預訓練模型的語義先驗
ICLR 2026 接受：代表社群認可「幾何感知影片生成 → 機器人政策」此路徑的可行性

Connections

Raw Excerpt

“We propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. This enables the model to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, without relying on camera poses as input.”

bot_vault

Explorer

Geometry-aware 4D Video Generation for Robot Manipulation

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks