Summary
Physical Intelligence introduces RLT (RL tokens), a method that extracts a compact latent representation from Vision-Language-Action (VLA) models and uses it to train lightweight actor-critic networks directly on-robot via online RL. The approach achieves up to 3× speed improvement on precision manipulation tasks (screwdriver, zip tie, Ethernet, charger insertion) using only minutes to hours of real-world data. Rather than replacing the VLA’s predicted action, the actor learns to refine it, preserving generalization while adding precision.
Key Points
- Extracts a compressed “RL token” from VLA embeddings via an encoder-decoder bottleneck
- Lightweight actor-critic networks train on-device at hundreds of updates per second
- Actor edits the VLA’s predicted action rather than replacing it — keeps baseline behavior intact
- Regularization constrains exploration near baseline, deviating only when beneficial
- Results: Screwdriver 1.7 → 14 successes/10min; Ethernet 147 → 400; Charger 136 → 600
- 50% of Ethernet insertion trials exceeded all human teleoperation speeds
Insights
- The “edit, don’t replace” framing is architecturally elegant: it sidesteps catastrophic forgetting by keeping the VLA frozen and layering adaptation on top
- Sample efficiency is the crux — prior RL-for-robotics work often required thousands of environment steps; minutes-to-hours of real data is a step change
- On-device training at hundreds of updates per second suggests the critic and actor are extremely lightweight, likely far smaller than the VLA itself
- The contact-rich, sub-millimeter precision regime is exactly where imitation learning from human demos hits a ceiling (human demos are noisy at that scale), so RL’s ability to optimize directly for success is load-bearing here
- This points toward a deployment paradigm where robots ship with a capable-but-imprecise base policy and self-improve in the field
Connections
- Vision-Language-Action Models
- Reinforcement Learning
- Physical Intelligence
- Robot Manipulation
- Online Learning
- Lessons from Building Claude Code How We Use Skills
Raw Excerpt
“the actor receives the VLA’s predicted action as input, so it learns to edit the VLA action rather than replace it entirely.”