State of VLA Research at ICLR 2026: Trends, Benchmarks, and Gaps

本文由 AI 分析生成

建立時間： 2026-03-22 來源： https://mbreuss.github.io/blog_post_iclr_26_vla.html

Summary

A practitioner’s field guide to Vision-Language-Action (VLA) model research at ICLR 2026, written by Moritz Reuss. The piece covers the 18x explosion in submissions (1 → 164 in two years), how to interpret saturated benchmarks (LIBERO, CALVIN, SIMPLER), nine active research directions, and the hidden gap between frontier closed-weight models and open-source VLAs. Two critically underrepresented problems are identified: data quality and in-context learning.

Moritz Reuss 撰寫的 ICLR 2026 VLA 研究領域指南。涵蓋投稿量的 18 倍爆炸性成長（兩年內從 1 篇到 164 篇）、如何解讀已飽和的基準測試（LIBERO、CALVIN、SIMPLER）、九個活躍研究方向，以及前沿閉源模型與開源 VLA 之間的隱性差距。文章指出兩個嚴重被低估的問題：資料品質與 in-context learning。

Key Points

VLA definition: model using internet-scale vision-language pretrained backbone + finetuned for control commands — pretraining scale is the key differentiator from regular multimodal policies
LBM (Large Behavior Models): trained on large-scale robot demos without VL pretraining; all VLAs on massive robot data are LBMs, but not vice versa
ICLR growth: 1 (2024) → 9 (2025) → 164 (2026); field transitioned from niche to mainstream
Benchmark saturation: LIBERO 95%+ is standard, minimal discrimination above; CALVIN ABC >4.0 standard; SIMPLER highly variable, cross-paper comparison difficult
9 research trends: Discrete Diffusion, Reasoning/ECoT, New Tokenizers, Efficient VLAs, RL for VLAs, Video Prediction, Better Benchmarks, Cross-Action-Space, Memory/Composition
Frontier gap: open-source VLAs match frontier on simulation but fail significantly on zero-shot open-world tasks — mirrors LLM open/closed weight divide
Underrepresented: data quality curation methods and in-context learning for physical tasks
Stage-aware RL (reach→grasp→transport→place) with semantic phase rewards is a notable practical approach

Insights

The 18x submission jump in one year is unusually fast even by ML standards — it signals that VLA is now a primary destination for CV/ML researchers seeking impact, which will compress iteration cycles but also increase noise
Benchmark saturation creating a “hidden gap” is the same problem described in the “AI moats” article: 60% (or 95% simulation) looks like success but is meaningless when the real bar is zero-shot generalization
“VLM backbone selection uncorrelated with standard VLM benchmarks” is a surprising finding — it means the conventional wisdom of “start with the best VLM” is empirically unfounded for robot control, suggesting robot-specific pretraining signals matter more than general VL capabilities
The RLT paper already in this vault (pi.website/research/rlt) is a direct example of Trend #5 (RL for VLAs): residual RL on top of a frozen VLA base policy — this vault is building a connected knowledge graph on this topic
Discrete diffusion solving the autoregressive bottleneck for ECoT is an elegant architectural insight: the reason ECoT was slow wasn’t the reasoning itself but generating action sequences token-by-token
Data quality being underrepresented despite being “widely acknowledged as critical” is a common pattern in ML research: the boring infrastructure problems (data curation, evaluation methodology) are less publishable than novel architectures

Connections

Raw Excerpt

Despite open-source VLAs matching frontier performance on simulation (LIBERO, CALVIN), significant gaps emerge in zero-shot open-world behavior post-pretraining. This parallels gaps in LLMs and VLMs between closed and open-weight models.

bot_vault

Explorer

State of VLA Research at ICLR 2026: Trends, Benchmarks, and Gaps

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks