本文由 AI 分析生成
建立時間: 2026-03-28 來源: https://arxiv.org/html/2602.06508v1
Summary
World-VLA-Loop proposes a closed-loop framework for jointly refining video world models and Vision-Language-Action (VLA) policies through iterative RL post-training entirely in simulation. The key innovation is a state-aware world model that predicts both future video frames and reward signals, trained on a curated SANS (Success and Near-Success) dataset. Policy failure rollouts are fed back to refine the world model, which in turn improves the next RL training round, creating a co-evolving cycle.
World-VLA-Loop 提出一個封閉迴圈框架,透過完全在模擬環境中進行迭代強化學習後訓練,聯合精煉視頻世界模型與 VLA 策略。核心創新是同時預測未來視頻幀與獎勵訊號的狀態感知世界模型,並透過成功與近成功軌跡資料集(SANS)訓練,形成策略失敗回饋改善世界模型、世界模型再提升策略訓練效果的共演化循環。
Prerequisites
- Vision-Language-Action (VLA) models — robot manipulation models mapping language to control actions
- Diffusion Transformers (DiT) — backbone for video generation used in Cosmos-Predict 2
- Group Relative Policy Optimization (GRPO) — RL algorithm used for VLA post-training
- Reinforcement Learning post-training — fine-tuning pretrained models using RL rewards rather than imitation
Core Idea
Existing robotic world models (video diffusion) struggle with poor action-following precision — they hallucinate successful outcomes even when given incorrect actions. World-VLA-Loop addresses this by (1) training on SANS data that includes near-success trajectories, forcing the model to distinguish fine-grained action consequences; (2) adding a reward prediction head jointly trained with the video generator, so rewards are intrinsically aligned with visual outcomes; and (3) establishing an iterative loop where VLA policy failures are fed back to refine the world model, which in turn makes the next RL round more accurate. This eliminates the need for costly physical robot interactions during RL training.
Results
| Task / Benchmark | This work | Baseline (SFT) | Delta |
|---|---|---|---|
| LIBERO-Object Task 1 | 97.9% | 73.9% | +24.0% |
| LIBERO-Object Task 2 | 91.9% | 73.9% | +18.0% |
| LIBERO-Goal Task 1 | 100% | 91.9% | +8.1% |
| LIBERO-Goal Task 2 | 96.2% | 86.1% | +10.1% |
| LIBERO-Spatial Task 1 | 93.9% | 83.9% | +10.0% |
| LIBERO-Spatial Task 2 | 94.0% | 87.9% | +6.1% |
| Real-World (avg) | 36.7% | 13.3% | +23.4% |
| Iterative refinement (real-world) | 50.0% | 13.3% | +36.7% |
| World model visual alignment | 87.9% avg | — | — |
| World model reward alignment | 86.4% avg | — | — |
Limitations
- Author-stated: LIBERO-100 (long-horizon tasks, 200+ video frames) excluded due to quality drift in autoregressive generation
- Author-stated: Real-world RL success rates evaluated within simulator, with final physical results separately reported
- Unstated: The SANS dataset construction requires manual teleoperation and failure trajectory collection, which has its own cost
- Unstated: The co-evolving loop’s convergence properties are not theoretically characterized; empirical results cover only 2 iterations
- Unstated: All real-world experiments use a fixed camera; generalization to varied viewpoints is unvalidated
Reproducibility
- Code: Project page at https://showlab.github.io/World-VLA-Loop/ (code availability not explicitly stated in the clip)
- Datasets: SANS dataset introduced in this work; ManiSkill and LIBERO benchmarks are public
- Compute: Built on Cosmos-Predict 2 (large pretrained model); fine-tuning requires significant GPU resources
Insights
The key insight is that near-success trajectories are more valuable than random failures for world model training — they lie in the “hard negative” zone where the model must learn fine-grained causal reasoning. The reward head being jointly trained with the video generator is elegant: it forces the generator to be causally precise rather than visually plausible. The external VLM reward comparison (Qwen3-VL hallucinates) confirms that general-purpose VLMs are unreliable reward models for RL.
Connections
Raw Excerpt
failure rollouts generated by the VLA policy are iteratively fed back to refine the world model’s precision, which in turn enhances subsequent RL optimization