Summary
A practitioner’s field guide to Vision-Language-Action (VLA) model research at ICLR 2026, written by Moritz Reuss. The piece covers the 18x explosion in submissions (1 → 164 in two years), how to interpret saturated benchmarks (LIBERO, CALVIN, SIMPLER), nine active research directions, and the hidden gap between frontier closed-weight models and open-source VLAs. Two critically underrepresented problems are identified: data quality and in-context learning.
Moritz Reuss 撰寫的 ICLR 2026 VLA 研究領域指南。涵蓋投稿量的 18 倍爆炸性成長(兩年內從 1 篇到 164 篇)、如何解讀已飽和的基準測試(LIBERO、CALVIN、SIMPLER)、九個活躍研究方向,以及前沿閉源模型與開源 VLA 之間的隱性差距。文章指出兩個嚴重被低估的問題:資料品質與 in-context learning。
Key Points
- VLA definition: model using internet-scale vision-language pretrained backbone + finetuned for control commands — pretraining scale is the key differentiator from regular multimodal policies
- LBM (Large Behavior Models): trained on large-scale robot demos without VL pretraining; all VLAs on massive robot data are LBMs, but not vice versa
- ICLR growth: 1 (2024) → 9 (2025) → 164 (2026); field transitioned from niche to mainstream
- Benchmark saturation: LIBERO 95%+ is standard, minimal discrimination above; CALVIN ABC >4.0 standard; SIMPLER highly variable, cross-paper comparison difficult
- 9 research trends: Discrete Diffusion, Reasoning/ECoT, New Tokenizers, Efficient VLAs, RL for VLAs, Video Prediction, Better Benchmarks, Cross-Action-Space, Memory/Composition
- Frontier gap: open-source VLAs match frontier on simulation but fail significantly on zero-shot open-world tasks — mirrors LLM open/closed weight divide
- Underrepresented: data quality curation methods and in-context learning for physical tasks
- Stage-aware RL (reach→grasp→transport→place) with semantic phase rewards is a notable practical approach
Insights
- The 18x submission jump in one year is unusually fast even by ML standards — it signals that VLA is now a primary destination for CV/ML researchers seeking impact, which will compress iteration cycles but also increase noise
- Benchmark saturation creating a “hidden gap” is the same problem described in the “AI moats” article: 60% (or 95% simulation) looks like success but is meaningless when the real bar is zero-shot generalization
- “VLM backbone selection uncorrelated with standard VLM benchmarks” is a surprising finding — it means the conventional wisdom of “start with the best VLM” is empirically unfounded for robot control, suggesting robot-specific pretraining signals matter more than general VL capabilities
- The RLT paper already in this vault (pi.website/research/rlt) is a direct example of Trend #5 (RL for VLAs): residual RL on top of a frozen VLA base policy — this vault is building a connected knowledge graph on this topic
- Discrete diffusion solving the autoregressive bottleneck for ECoT is an elegant architectural insight: the reason ECoT was slow wasn’t the reasoning itself but generating action sequences token-by-token
- Data quality being underrepresented despite being “widely acknowledged as critical” is a common pattern in ML research: the boring infrastructure problems (data curation, evaluation methodology) are less publishable than novel architectures
Connections
- RLT: Online RL for Precise Robot Manipulation
- Vision-Language-Action Models
- Reinforcement Learning
- Robotics
- Diffusion Models
- Embodied AI
- Physical Intelligence
- Benchmarking
Raw Excerpt
Despite open-source VLAs matching frontier performance on simulation (LIBERO, CALVIN), significant gaps emerge in zero-shot open-world behavior post-pretraining. This parallels gaps in LLMs and VLMs between closed and open-weight models.