Summary

World-VLA-Loop proposes a closed-loop framework for jointly refining video world models and Vision-Language-Action (VLA) policies through iterative RL post-training entirely in simulation. The key innovation is a state-aware world model that predicts both future video frames and reward signals, trained on a curated SANS (Success and Near-Success) dataset. Policy failure rollouts are fed back to refine the world model, which in turn improves the next RL training round, creating a co-evolving cycle.

World-VLA-Loop 提出一個封閉迴圈框架,透過完全在模擬環境中進行迭代強化學習後訓練,聯合精煉視頻世界模型與 VLA 策略。核心創新是同時預測未來視頻幀與獎勵訊號的狀態感知世界模型,並透過成功與近成功軌跡資料集(SANS)訓練,形成策略失敗回饋改善世界模型、世界模型再提升策略訓練效果的共演化循環。

Prerequisites

  • Vision-Language-Action (VLA) models — robot manipulation models mapping language to control actions
  • Diffusion Transformers (DiT) — backbone for video generation used in Cosmos-Predict 2
  • Group Relative Policy Optimization (GRPO) — RL algorithm used for VLA post-training
  • Reinforcement Learning post-training — fine-tuning pretrained models using RL rewards rather than imitation

Core Idea

Existing robotic world models (video diffusion) struggle with poor action-following precision — they hallucinate successful outcomes even when given incorrect actions. World-VLA-Loop addresses this by (1) training on SANS data that includes near-success trajectories, forcing the model to distinguish fine-grained action consequences; (2) adding a reward prediction head jointly trained with the video generator, so rewards are intrinsically aligned with visual outcomes; and (3) establishing an iterative loop where VLA policy failures are fed back to refine the world model, which in turn makes the next RL round more accurate. This eliminates the need for costly physical robot interactions during RL training.

Results

Task / BenchmarkThis workBaseline (SFT)Delta
LIBERO-Object Task 197.9%73.9%+24.0%
LIBERO-Object Task 291.9%73.9%+18.0%
LIBERO-Goal Task 1100%91.9%+8.1%
LIBERO-Goal Task 296.2%86.1%+10.1%
LIBERO-Spatial Task 193.9%83.9%+10.0%
LIBERO-Spatial Task 294.0%87.9%+6.1%
Real-World (avg)36.7%13.3%+23.4%
Iterative refinement (real-world)50.0%13.3%+36.7%
World model visual alignment87.9% avg
World model reward alignment86.4% avg

Limitations

  • Author-stated: LIBERO-100 (long-horizon tasks, 200+ video frames) excluded due to quality drift in autoregressive generation
  • Author-stated: Real-world RL success rates evaluated within simulator, with final physical results separately reported
  • Unstated: The SANS dataset construction requires manual teleoperation and failure trajectory collection, which has its own cost
  • Unstated: The co-evolving loop’s convergence properties are not theoretically characterized; empirical results cover only 2 iterations
  • Unstated: All real-world experiments use a fixed camera; generalization to varied viewpoints is unvalidated

Reproducibility

  • Code: Project page at https://showlab.github.io/World-VLA-Loop/ (code availability not explicitly stated in the clip)
  • Datasets: SANS dataset introduced in this work; ManiSkill and LIBERO benchmarks are public
  • Compute: Built on Cosmos-Predict 2 (large pretrained model); fine-tuning requires significant GPU resources

Insights

The key insight is that near-success trajectories are more valuable than random failures for world model training — they lie in the “hard negative” zone where the model must learn fine-grained causal reasoning. The reward head being jointly trained with the video generator is elegant: it forces the generator to be causally precise rather than visually plausible. The external VLM reward comparison (Qwen3-VL hallucinates) confirms that general-purpose VLMs are unreliable reward models for RL.

Connections

Raw Excerpt

failure rollouts generated by the VLA policy are iteratively fed back to refine the world model’s precision, which in turn enhances subsequent RL optimization