VLA-0: Building State-of-the-Art VLAs with Zero Modification

本文由 AI 分析生成

建立時間： 2026-03-28 來源： https://vla0.github.io/

Summary

VLA-0 (arXiv:2510.13054, NVIDIA) investigates the simplest possible approach to building Vision-Language-Action models: representing robot actions directly as plain text in a standard VLM with zero architectural modification. Despite (or because of) this simplicity, VLA-0 achieves the best average rank (1.0) among models without large-scale pretraining on the LIBERO benchmark (94.7% average success), outperforming more complex approaches with custom action heads or discretized tokens.

VLA-0（arXiv:2510.13054，NVIDIA）研究了構建視覺語言行動模型的最簡單方法：在標準 VLM 中將機器人動作直接表示為純文本，零架構修改。儘管（或正因為）這種簡單性，VLA-0 在 LIBERO 基準測試中，在無大規模預訓練的模型中達到最佳平均排名（1.0），平均成功率 94.7%，優於使用自定義行動頭或離散化標記的更複雜方法。

Prerequisites

Vision-Language Models (VLMs) — the base architecture (e.g., PaliGemma, Llama variants) that VLA-0 uses without modification; needed to understand what “zero modification” means
Action tokenization — the main alternative to VLA-0’s approach: discretizing continuous action vectors into learned tokens or extending the VLM vocabulary; understanding this enables appreciating what VLA-0 avoids
LIBERO benchmark — a standard robot manipulation evaluation suite with four task categories (Spatial, Object, Goal, Long-horizon); required to interpret the results table
Behavior cloning / imitation learning — VLA-0 is trained via supervised imitation on robot demonstrations; understanding the training paradigm explains why text-as-action can work

Core Idea

VLA-0’s hypothesis is that the complexity added by most VLA approaches — custom action vocabularies, discrete action tokens, separate action heads — may be unnecessary, and that treating actions as continuous-valued numbers formatted as plain text strings leverages the VLM’s existing numerical reasoning and instruction-following capabilities. The key design insight is choosing the right text format for actions (decimal numbers with appropriate precision) and training procedure, rather than any architectural modification. This allows VLA-0 to benefit from the full VLM pretraining without any adaptation mismatch from architectural surgery.

Results

Model	Type	Avg Success	Avg Rank	Notes
VLA-0 (ours)	Simple (text)	94.7%	1.0	No large-scale pretraining
π₀.₅-KI	Gen Head	93.3%	2.3	No large-scale pretraining
OpenVLA-OFT	Custom	91.9%	2.8	No large-scale pretraining
SmolVLA (2.25B)	Gen Head	88.8%	4.0	No large-scale pretraining
Diffusion Policy	N/A	72.4%	6.5	No large-scale pretraining
GR00T-N1	Gen Head	93.9%	4.5	With large-scale pretraining
π₀	Gen Head	94.2%	3.3	With large-scale pretraining
OpenVLA-OFT	Custom	97.1%	1.5	With large-scale pretraining

VLA-0 also outperforms SmolVLA on real robot tasks using the SO-100 platform.

Limitations

Author-stated: “the simplest strategy… has remained largely unexplored” — the paper is partly a documentation of a gap rather than a full analysis of why text representation works
Unstated: LIBERO is a tabletop manipulation benchmark; generalization to contact-rich, deformable, or mobile manipulation is not demonstrated
Unstated: “zero modification” still requires choosing text formatting — decimal precision, tokenization of numbers — which is itself a design choice not fully ablated
Unstated: real-robot results on SO-100 appear limited in scope; full quantitative real-robot evaluation not available from excerpt

Reproducibility

Code: project page at vla0.github.io; code availability not confirmed from available excerpt
Datasets: LIBERO benchmark (standard); SO-100 real robot demonstrations
Compute: not specified in available excerpt; likely standard VLM fine-tuning compute

Insights

The broader significance: VLA-0 joins a line of work showing that simplicity often beats complexity in ML when the base model is powerful enough. Custom action heads and discrete tokens add complexity and risk breaking the VLM’s pretrained representations; text-as-action is simply asking the model to do what it already does. The ranking metric (Avg Rank 1.0 beats all models without large-scale pretraining) is more robust than average accuracy, since it penalizes variance across task types. The comparison against models with large-scale pretraining (where VLA-0 still achieves rank 2.8) suggests text representation alone doesn’t fully substitute for data scale.

Connections

Raw Excerpt

Curiously, the simplest strategy of representing actions directly as text has remained largely unexplored. This work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only effective; it is surprisingly powerful. With the right design, VLA-0 outperforms more involved models.

bot_vault

Explorer

VLA-0: Building State-of-the-Art VLAs with Zero Modification

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks