An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

本文由 AI 分析生成

建立時間： 2026-03-28 來源： https://arxiv.org/abs/2512.11362

Summary

A comprehensive IEEE TPAMI survey of Vision-Language-Action (VLA) models structured around three axes: core Modules (perception, language, action), historical Milestones, and the five key Challenges (Representation, Execution, Generalization, Safety, Dataset/Evaluation). Designed as both a newcomer’s guide and a research roadmap for embodied intelligence.

面向 IEEE TPAMI 投稿的 VLA 模型全面綜述，結構化為三個軸：核心模組（感知、語言、動作）、歷史里程碑、以及五大挑戰（表示、執行、泛化、安全、數據/評估）。定位為具身智能的入門指南與研究路線圖。

Prerequisites

Large Language Models (LLMs) — VLA models extend LLMs to ground language understanding in perception and physical action; understanding transformer architecture and instruction following is foundational
Foundation models / pretraining — VLA models leverage pretrained vision-language representations; understanding transfer learning explains why they generalize across tasks
Imitation learning / behavior cloning — most VLA action heads are trained via IL from demonstrations; understanding the relationship between observation, state, and action is required

Core Idea

VLA models unify perception (vision), language understanding, and motor control into a single architecture, typically by adapting a pretrained vision-language model with an action head that outputs robot commands. The survey organizes the field by decomposing VLAs into their constituent modules (visual encoder, language backbone, action decoder) and tracing how each module has evolved across key milestone models (RT-2, OpenVLA, pi0, etc.). The five-challenge framework then maps the open problems: how to learn rich physical representations, execute actions with sufficient precision and speed, generalize across embodiments and environments, ensure safe deployment, and build adequate evaluation infrastructure.

Results

Survey paper — no single benchmark result. Key findings from reviewed literature:

RT-2 demonstrated that language-conditioned web-scale pretraining transfers to robot manipulation tasks
Diffusion-based action heads (e.g., pi0) outperform categorical/regression heads on dexterous manipulation
Generalization across embodiments remains an open problem; most models show significant performance drops on novel hardware
Safety evaluation methodology is underdeveloped relative to capability evaluation

Limitations

Author-stated: survey covers a fast-moving field; a live-updated project page is maintained to address staleness
Unstated: the five-challenge taxonomy (Representation, Execution, Generalization, Safety, Dataset) reflects the current community consensus but may not capture emerging paradigms like world-model-based planning or neuro-symbolic approaches

Reproducibility

Code: survey paper; references and links to original model repos included
Datasets: covers standard benchmarks (RLBench, LIBERO, BridgeData, Open X-Embodiment)
Compute: N/A for survey; individual reviewed models range from single-GPU inference to large-scale multi-GPU training clusters

Insights

The five-challenge taxonomy is the most useful practical output of this survey for researchers: it provides a common vocabulary for positioning new work. “Where does your paper fit — Representation, Execution, Generalization, Safety, or Evaluation?” is a question any VLA paper should be able to answer.

The survey’s emphasis on Safety as a top-level challenge (not just an afterthought) reflects the field’s maturation: as VLA systems approach real-world deployment in manipulation and mobile robotics, safety constraints must be architecturally designed in, not retrofitted.

Connections

Raw Excerpt

VLA models have catalyzed real-world human environments by connecting vision, language, and physical action. This survey offers a clear and structured guide: Modules, Milestones, and Challenges — the natural learning path of a researcher entering this space.

bot_vault

Explorer

An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks