World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

本文由 AI 分析生成

建立時間： 2026-03-28 來源： https://arxiv.org/html/2602.06508v1

Summary

World-VLA-Loop proposes a closed-loop framework for jointly refining video world models and Vision-Language-Action (VLA) policies through iterative RL post-training entirely in simulation. The key innovation is a state-aware world model that predicts both future video frames and reward signals, trained on a curated SANS (Success and Near-Success) dataset. Policy failure rollouts are fed back to refine the world model, which in turn improves the next RL training round, creating a co-evolving cycle.

World-VLA-Loop 提出一個封閉迴圈框架，透過完全在模擬環境中進行迭代強化學習後訓練，聯合精煉視頻世界模型與 VLA 策略。核心創新是同時預測未來視頻幀與獎勵訊號的狀態感知世界模型，並透過成功與近成功軌跡資料集（SANS）訓練，形成策略失敗回饋改善世界模型、世界模型再提升策略訓練效果的共演化循環。

Prerequisites

Vision-Language-Action (VLA) models — robot manipulation models mapping language to control actions
Diffusion Transformers (DiT) — backbone for video generation used in Cosmos-Predict 2
Group Relative Policy Optimization (GRPO) — RL algorithm used for VLA post-training
Reinforcement Learning post-training — fine-tuning pretrained models using RL rewards rather than imitation

Core Idea

Existing robotic world models (video diffusion) struggle with poor action-following precision — they hallucinate successful outcomes even when given incorrect actions. World-VLA-Loop addresses this by (1) training on SANS data that includes near-success trajectories, forcing the model to distinguish fine-grained action consequences; (2) adding a reward prediction head jointly trained with the video generator, so rewards are intrinsically aligned with visual outcomes; and (3) establishing an iterative loop where VLA policy failures are fed back to refine the world model, which in turn makes the next RL round more accurate. This eliminates the need for costly physical robot interactions during RL training.

Results

Task / Benchmark	This work	Baseline (SFT)	Delta
LIBERO-Object Task 1	97.9%	73.9%	+24.0%
LIBERO-Object Task 2	91.9%	73.9%	+18.0%
LIBERO-Goal Task 1	100%	91.9%	+8.1%
LIBERO-Goal Task 2	96.2%	86.1%	+10.1%
LIBERO-Spatial Task 1	93.9%	83.9%	+10.0%
LIBERO-Spatial Task 2	94.0%	87.9%	+6.1%
Real-World (avg)	36.7%	13.3%	+23.4%
Iterative refinement (real-world)	50.0%	13.3%	+36.7%
World model visual alignment	87.9% avg	—	—
World model reward alignment	86.4% avg	—	—

Limitations

Author-stated: LIBERO-100 (long-horizon tasks, 200+ video frames) excluded due to quality drift in autoregressive generation
Author-stated: Real-world RL success rates evaluated within simulator, with final physical results separately reported
Unstated: The SANS dataset construction requires manual teleoperation and failure trajectory collection, which has its own cost
Unstated: The co-evolving loop’s convergence properties are not theoretically characterized; empirical results cover only 2 iterations
Unstated: All real-world experiments use a fixed camera; generalization to varied viewpoints is unvalidated

Reproducibility

Code: Project page at https://showlab.github.io/World-VLA-Loop/ (code availability not explicitly stated in the clip)
Datasets: SANS dataset introduced in this work; ManiSkill and LIBERO benchmarks are public
Compute: Built on Cosmos-Predict 2 (large pretrained model); fine-tuning requires significant GPU resources

Insights

The key insight is that near-success trajectories are more valuable than random failures for world model training — they lie in the “hard negative” zone where the model must learn fine-grained causal reasoning. The reward head being jointly trained with the video generator is elegant: it forces the generator to be causally precise rather than visually plausible. The external VLM reward comparison (Qwen3-VL hallucinates) confirms that general-purpose VLMs are unreliable reward models for RL.

Connections

Raw Excerpt

failure rollouts generated by the VLA policy are iteratively fed back to refine the world model’s precision, which in turn enhances subsequent RL optimization

bot_vault

Explorer

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks