Vision Language Action Models (VLA) & Policies for Robots

本文由 AI 分析生成

建立時間： 2026-04-12 來源： https://learnopencv.com/vision-language-action-models-lerobot-policy/

Summary

A comprehensive tutorial survey of Vision-Language-Action (VLA) models for robotic control, covering the evolution from RT-2 (Google, 2023) through Octo, OpenVLA, π0, Groot N1, and LeRobot. The article explains how VLMs are extended with action tokens to produce motor commands, contrasts the dual-system (VLM planner + diffusion executor) versus end-to-end architectures, and provides code walkthroughs for Octo and OpenVLA inference using the LeRobot framework.

本文是 VLA（視覺語言動作）模型的全面教程調查，從 Google RT-2 開始，追蹤至 Octo、OpenVLA、π0 和 Nvidia Groot N1。文章說明 VLM 如何透過增加 action token 延伸為機器人控制模型，比較 System 1/2 雙系統架構（VLM 高階規劃 + 擴散模型低階執行）與端對端架構的差異，並提供 LeRobot 框架的實作程式碼。

Prerequisites

VLM（視覺語言模型） — VLA 的骨幹，理解 VLM 的多模態 token 處理機制是理解 action token 擴充方式的基礎
Diffusion Models — 作為 System 1 的 action decoder，需理解其為何比 discrete action token 更適合低階控制
Imitation Learning / Behavioral Cloning — VLA 的訓練範式，了解從 demonstration trajectory 學習 action 的概念
Robot Control（end-effector / action space） — 理解 7D action token（xyz + roll/pitch/yaw + gripper）的物理意涵

Core Idea

VLA 的核心突破是將 VLM 的 token prediction 機制直接延伸到機器人動作輸出：把 robot action（end-effector 的位移、旋轉、夾爪狀態）編碼成離散 token，使 VLM 可用相同的 next-token prediction 訓練範式生成 action 序列。這樣做的好處是充分利用 VLM 在網路規模資料上學到的語義和物理世界理解，而不是從頭訓練一個感知-行動模型。更先進的架構（如 π0、Groot N1）則採用雙系統：VLM 負責高階語義規劃（System 2，慢思考），擴散模型負責低階靈巧動作執行（System 1，快思考），類似 Kahneman 的雙過程理論。

Results

Model	Parameters	Training Data	Notable
RT-2 (Google)	55B (PaLI-X + PaLM-E)	Web + robotics co-training	First VLA, emergent generalization
Octo (Berkeley)	93M	800K demos, Open X-Embodiment	Open source, on-par with RT-2
OpenVLA (Stanford)	7B (LLaMA 2)	970K episodes, Open X-Embodiment	Outperforms RT-2-X with 7x fewer params
π0 (Physical Intelligence)	—	Proprietary	Generalist, language-conditioned diffusion
Groot N1 (Nvidia)	—	Proprietary	Dual VLM + diffusion, humanoid focus

Limitations

Author-stated: OpenVLA 在 out-of-distribution 資料上不如 RT-2，因為後者使用了網路規模資料訓練
Author-stated: 傳統 individual policy 在特定硬體和任務上仍有優勢
Unstated: 文章以 2025/04 的 SOTA 為基準，π0 和 Groot N1 細節以作者描述為準，缺乏第三方驗證
Unstated: 推論速度（control frequency）問題在所有 VLA 中都有但文章未深入討論；5Hz 的 OpenVLA 對高速靈巧任務可能不足

Reproducibility

Code: Octo — https://github.com/octo-models/octo (open source); OpenVLA — https://github.com/openvla/openvla (open source); LeRobot — Hugging Face
Datasets: Open X-Embodiment Dataset（bridge dataset via GCS）；RLDS format
Compute: Octo 93M 可在單 GPU 推論；OpenVLA 7B 需較多 VRAM；fine-tuning 支援 LoRA

Insights

最值得記錄的架構洞見：為什麼 diffusion head 比 discrete action token 更適合低階控制？Octo 的作者實驗顯示 diffusion-based decoder 在 action accuracy 上優於直接輸出離散 token，這與擴散模型在圖像生成中的優勢類似——連續空間建模比強制離散化更能保留細粒度資訊。OpenVLA 展示了一個反直覺的結果：7B 模型用 next-token prediction + cross-entropy loss，只需 255 個 action token 就能表示完整的機器人動作空間，且在 in-distribution 任務上超越 55B 的 RT-2-X。

Connections

Raw Excerpt

VLA extends VLM with an additional action and observation state tokens. State: It is a single token and represents robots observations such as sensor values, gripper positions and angles etc. Action: This token represent the sequence of motor commands to be performed to follow along the trajectory with precise control.

bot_vault

Explorer

Vision Language Action Models (VLA) & Policies for Robots

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks