ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations

本文由 AI 分析生成

建立時間： 2026-04-02 來源： https://rewind-reward.github.io/

Summary

ReWiND (CoRL 2025) is a framework that enables sample-efficient adaptation to new robot manipulation tasks using only language instructions — without per-task demonstrations. It pre-trains a language-conditioned reward model and policy on a small demonstration set augmented with Open-X data, then uses language-guided RL online to generalize to unseen task variations. ReWiND achieves 2x better performance than baselines on Meta-World unseen tasks and 5x improvement on real-world bimanual tasks in 1 hour of online interaction.

ReWiND（CoRL 2025）是一個僅需語言指令即可適應新機器人操作任務的框架，無需每個新任務的示範。預訓練語言條件化獎勵模型與策略後，使用語言引導強化學習在線泛化。在 Meta-World 未見任務上超越基線 2 倍，現實世界雙手操作任務 1 小時內提升 5 倍。

Prerequisites

Offline RL (IQL — Implicit Q-Learning) — ReWiND pre-trains the policy using IQL on relabeled offline data; understanding offline RL constraints (distribution shift) is needed.
Language-conditioned robot policies — the reward model and policy are both language-conditioned; understanding CLIP-style embedding alignment helps.
Video progress estimation — the reward model scores task progress from video frames; understanding temporal video representations is relevant.

Core Idea

The key innovation is the reward model design and training. ReWiND augments a small demo dataset with a clever trick: reversing successful demo videos generates synthetic failure trajectories for free, providing dense reward signal even when the policy is failing. The reward model learns to assign progress scores from 0 to 1 along video-language pairs, with the reversed video providing the decreasing-reward half of the training distribution. This makes the reward model robust to near-failure states — a critical property for online RL, where the policy starts bad. Pre-trained on Open-X + relabeled demos via IQL, then fine-tuned online with RL per new task.

Results

Setting	ReWiND	Best Baseline	Delta
Meta-World (8 unseen tasks, 100k steps)	79% success	~40%	+2x
Real-world bimanual (seen + unseen tasks, 1hr)	High success	Low	+5x

Limitations

Author-stated: evaluated on tabletop manipulation; locomotion and contact-rich tasks not tested.
Unstated: the “video rewinding” trick assumes successful demonstrations are available for pre-training — zero-shot to truly novel tasks (no relevant demos) remains unaddressed.

Reproducibility

Code: OpenReview paper available; project page at rewind-reward.github.io.
Datasets: Meta-World simulation; Open-X subset; proprietary bimanual real-world setup.
Compute: offline pre-training (IQL) + online RL fine-tuning; moderate compute.

Insights

The video-rewinding trick is elegantly simple — it’s a data augmentation strategy that doesn’t require any additional collection, only temporal reversal. This is the kind of insight that seems obvious in hindsight. The 5x real-world improvement in 1 hour is the headline result: it suggests reward model + online RL can do the adaptation work that would otherwise require hours of demonstration collection per task. This directly challenges the “demonstration bottleneck” in robot learning.

Connections

Raw Excerpt

We beat baselines by 2X in simulation and improve real-world pre-trained policies by 5X in just 1 hour. We pre-train a policy and reward model from a small set of language-labeled demos. Then, we solve unseen task variations via language-guided RL — without additional demos.

bot_vault

Explorer

ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks