Towards a Unified Understanding of Robot Manipulation: Comprehensive Survey of Learning-Based Control

本文由 AI 分析生成

建立時間： 2026-04-05 來源： https://arxiv.org/abs/2510.10903

Summary

A comprehensive survey (arXiv 2510.10903) covering the full landscape of robot manipulation — from classical non-learning control through modern VLA-based approaches. The key contribution is a new two-level taxonomy: high-level planners (language, code, motion, affordance, 3D representations) and low-level learning-based control (input modeling, latent learning, policy learning). Also provides the first dedicated bottleneck taxonomy focused on data and generalization. Covers benchmarks, task types, and real-world applications.

全面的機器人操作調查論文，涵蓋從傳統非學習控制到現代 VLA 方法的完整景觀。核心貢獻是新的兩層分類法：高層規劃（語言、程式碼、運動、可供性、3D 表示）和低層學習控制（輸入建模、潛在學習、策略學習）。

Prerequisites

Imitation Learning (IL) — the dominant training paradigm for modern manipulation; required to understand ACT, diffusion policy, and flow matching policy sections
Reinforcement Learning (RL) — covers SERL and other online RL approaches; needed for Section 6.1.1 and bridging IL/RL methods
Vision-Language Models (VLMs) — VLA models build on VLMs by adding action prediction heads; prerequisite for Section 6.2.2
Transformer architecture — action chunking transformers (ACT) and autoregressive policy learning are transformer-based; needed throughout Section 6
Diffusion models — Diffusion Policy and Flow Matching Policy are diffusion-based; prerequisite for Sections 6.4.3-6.4.4
Kinematics and robot hardware — Section 2 covers hardware platforms; basic robotics kinematic understanding helps interpret task descriptions

Core Idea

The survey’s central insight is that the classical high-level/low-level planning split is too coarse for modern learning-based systems. The new taxonomy separates high-level planning into five representation types (language/code/motion/affordance/3D) and low-level control into three learning questions: (1) what to input to the model (input modeling), (2) how to learn a useful latent space (latent learning), and (3) how to decode latent to action (policy learning: MLP, Transformer, Diffusion, Flow Matching, SSM). This decomposition makes it easier to locate where a given paper contributes and to identify which bottleneck it addresses.

Results

The survey is taxonomic rather than benchmark-focused. Key empirical data points cited:

Aspect	Finding
Data scale	16K+ datasets in LeRobotDataset community (as of 2025)
Task diversity	130+ VLA tasks in LIBERO; 50+ in Meta-World
Hardware range	€225 SO-100 to multi-hundred-thousand dollar humanoids
Failure modes	Data quality and generalization identified as primary bottlenecks

Limitations

Author-stated: survey scope focused on manipulation; locomotion and navigation covered only as background
Author-stated: non-learning-based methods (classical control, motion planning) covered briefly as background, not reviewed comprehensively
Unstated: written around late 2025; fast-moving field means some VLA model results will be superseded quickly
Unstated: cross-embodiment generalization section is less mature than single-arm manipulation sections — fewer standardized benchmarks exist

Reproducibility

Code: companion GitHub repo linked (Awesome-Robotics-Manipulation)
Datasets: all benchmarks cited are publicly available (LIBERO, Meta-World, OXE, RoboSet, etc.)
Compute: N/A — survey paper

Insights

§7.1.2 Data Utilization is the most actionable part of the survey for practitioners. Five data utilization strategies are surveyed:

Data Selection / Filtering: EIL (temporal cycle-consistency filtering), L2D (preference-based demo selection), Re-Mix (minimax domain reweighting), DC-IL (action divergence + transition diversity as quality metrics), EAD (compatibility signals), UVP (pretraining image distribution matters more than dataset size), ILID (state discriminator scoring), MimicLabs (camera-pose and spatial diversity for retrieval)
Data Retrieval: VINN (nearest-neighbor in learned latent), SAILOR (sub-trajectory skill retrieval), DINOBot (DINO-ViT pixel alignment), STRAP (visual foundation models + time-invariant alignment)
Data Augmentation: three families — label/trajectory relabeling (DAAG, S2I), physics-consistent augmentation (rigid-body transforms), generative augmentation (GenAug, RoVi-Aug for cross-embodiment zero-shot transfer)
Data Expansion: generative synthesis (Generative Predecessor Models, SAFARI, TASTE-Rob) and corrective expansion (JUICER skill decomposition, Diff-DAGger uncertainty-guided collection)
Data Reweighting: by expertise (Beliaev et al.), feasibility (FABCO), or VLM preference labels (PLARE contrastive learning)

The bottleneck taxonomy (Section 7) is the most practically useful section for researchers. The paper explicitly identifies data collection, data utilization, and generalization as distinct problems. This is rare in survey papers that typically just catalog methods.

The inclusion of flow matching policies (Section 6.4.4) alongside diffusion is notable — flow matching is faster at inference and increasingly competitive. This signal suggests diffusion policy dominance may be temporary.

The real-world applications section (Section 8) distinguishes household from agricultural from industrial use cases — manipulation in agriculture (harvesting, sorting) is an underexplored area compared to industrial bin-picking or household manipulation.

瓶頸分類（第 7 節）最具實用價值。§7.1.2 資料利用涵蓋五個策略：資料選擇（EIL 軌跡過濾、DC-IL 動作散度品質指標）、資料檢索（VINN 近鄰搜索）、資料擴增（生成式跨體態增強 RoVi-Aug）、資料擴展（JUICER 技能分解重組）、資料重加權（FABCO 可行性權重、PLARE VLM 偏好標籤）。流匹配策略（6.4.4 節）推論更快，擴散策略的主導地位可能是暫時的。

Connections

Raw Excerpt

We extend the classical division between high-level planning and low-level control by broadening high-level planning to include language, code, motion, affordance, and 3D representations, while introducing a new taxonomy of low-level learning-based control grounded in training paradigms such as input modeling, latent learning, and policy learning.

bot_vault

Explorer

Towards a Unified Understanding of Robot Manipulation: Comprehensive Survey of Learning-Based Control

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks