Summary

VLA-0 (arXiv:2510.13054, NVIDIA) investigates the simplest possible approach to building Vision-Language-Action models: representing robot actions directly as plain text in a standard VLM with zero architectural modification. Despite (or because of) this simplicity, VLA-0 achieves the best average rank (1.0) among models without large-scale pretraining on the LIBERO benchmark (94.7% average success), outperforming more complex approaches with custom action heads or discretized tokens.

VLA-0(arXiv:2510.13054,NVIDIA)研究了構建視覺語言行動模型的最簡單方法:在標準 VLM 中將機器人動作直接表示為純文本,零架構修改。儘管(或正因為)這種簡單性,VLA-0 在 LIBERO 基準測試中,在無大規模預訓練的模型中達到最佳平均排名(1.0),平均成功率 94.7%,優於使用自定義行動頭或離散化標記的更複雜方法。

Prerequisites

  • Vision-Language Models (VLMs) — the base architecture (e.g., PaliGemma, Llama variants) that VLA-0 uses without modification; needed to understand what “zero modification” means
  • Action tokenization — the main alternative to VLA-0’s approach: discretizing continuous action vectors into learned tokens or extending the VLM vocabulary; understanding this enables appreciating what VLA-0 avoids
  • LIBERO benchmark — a standard robot manipulation evaluation suite with four task categories (Spatial, Object, Goal, Long-horizon); required to interpret the results table
  • Behavior cloning / imitation learning — VLA-0 is trained via supervised imitation on robot demonstrations; understanding the training paradigm explains why text-as-action can work

Core Idea

VLA-0’s hypothesis is that the complexity added by most VLA approaches — custom action vocabularies, discrete action tokens, separate action heads — may be unnecessary, and that treating actions as continuous-valued numbers formatted as plain text strings leverages the VLM’s existing numerical reasoning and instruction-following capabilities. The key design insight is choosing the right text format for actions (decimal numbers with appropriate precision) and training procedure, rather than any architectural modification. This allows VLA-0 to benefit from the full VLM pretraining without any adaptation mismatch from architectural surgery.

Results

ModelTypeAvg SuccessAvg RankNotes
VLA-0 (ours)Simple (text)94.7%1.0No large-scale pretraining
π₀.₅-KIGen Head93.3%2.3No large-scale pretraining
OpenVLA-OFTCustom91.9%2.8No large-scale pretraining
SmolVLA (2.25B)Gen Head88.8%4.0No large-scale pretraining
Diffusion PolicyN/A72.4%6.5No large-scale pretraining
GR00T-N1Gen Head93.9%4.5With large-scale pretraining
π₀Gen Head94.2%3.3With large-scale pretraining
OpenVLA-OFTCustom97.1%1.5With large-scale pretraining

VLA-0 also outperforms SmolVLA on real robot tasks using the SO-100 platform.

Limitations

  • Author-stated: “the simplest strategy… has remained largely unexplored” — the paper is partly a documentation of a gap rather than a full analysis of why text representation works
  • Unstated: LIBERO is a tabletop manipulation benchmark; generalization to contact-rich, deformable, or mobile manipulation is not demonstrated
  • Unstated: “zero modification” still requires choosing text formatting — decimal precision, tokenization of numbers — which is itself a design choice not fully ablated
  • Unstated: real-robot results on SO-100 appear limited in scope; full quantitative real-robot evaluation not available from excerpt

Reproducibility

  • Code: project page at vla0.github.io; code availability not confirmed from available excerpt
  • Datasets: LIBERO benchmark (standard); SO-100 real robot demonstrations
  • Compute: not specified in available excerpt; likely standard VLM fine-tuning compute

Insights

The broader significance: VLA-0 joins a line of work showing that simplicity often beats complexity in ML when the base model is powerful enough. Custom action heads and discrete tokens add complexity and risk breaking the VLM’s pretrained representations; text-as-action is simply asking the model to do what it already does. The ranking metric (Avg Rank 1.0 beats all models without large-scale pretraining) is more robust than average accuracy, since it penalizes variance across task types. The comparison against models with large-scale pretraining (where VLA-0 still achieves rank 2.8) suggests text representation alone doesn’t fully substitute for data scale.

Connections

Raw Excerpt

Curiously, the simplest strategy of representing actions directly as text has remained largely unexplored. This work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only effective; it is surprisingly powerful. With the right design, VLA-0 outperforms more involved models.