Summary

Physical Intelligence introduces RLT (RL tokens), a method that extracts a compact latent representation from Vision-Language-Action (VLA) models and uses it to train lightweight actor-critic networks directly on-robot via online RL. The approach achieves up to 3× speed improvement on precision manipulation tasks (screwdriver, zip tie, Ethernet, charger insertion) using only minutes to hours of real-world data. Rather than replacing the VLA’s predicted action, the actor learns to refine it, preserving generalization while adding precision.

Key Points

  • Extracts a compressed “RL token” from VLA embeddings via an encoder-decoder bottleneck
  • Lightweight actor-critic networks train on-device at hundreds of updates per second
  • Actor edits the VLA’s predicted action rather than replacing it — keeps baseline behavior intact
  • Regularization constrains exploration near baseline, deviating only when beneficial
  • Results: Screwdriver 1.7 → 14 successes/10min; Ethernet 147 → 400; Charger 136 → 600
  • 50% of Ethernet insertion trials exceeded all human teleoperation speeds

Insights

  • The “edit, don’t replace” framing is architecturally elegant: it sidesteps catastrophic forgetting by keeping the VLA frozen and layering adaptation on top
  • Sample efficiency is the crux — prior RL-for-robotics work often required thousands of environment steps; minutes-to-hours of real data is a step change
  • On-device training at hundreds of updates per second suggests the critic and actor are extremely lightweight, likely far smaller than the VLA itself
  • The contact-rich, sub-millimeter precision regime is exactly where imitation learning from human demos hits a ceiling (human demos are noisy at that scale), so RL’s ability to optimize directly for success is load-bearing here
  • This points toward a deployment paradigm where robots ship with a capable-but-imprecise base policy and self-improve in the field

Connections

Raw Excerpt

“the actor receives the VLA’s predicted action as input, so it learns to edit the VLA action rather than replace it entirely.”