本文由 AI 分析生成
建立時間: 2026-03-26 來源: https://arxiv.org/abs/2507.11840
Summary
This July 2025 survey traces the full arc of robotic manipulation — from mechanical programming through learned controllers to embodied intelligence — with particular focus on contemporary dexterous systems. Two research directions dominate: data collection (simulation, human demonstrations, teleoperation) and skill learning (imitation and reinforcement learning). The paper identifies three fundamental obstacles currently blocking progress toward truly capable dexterous robots.
此 2025 年 7 月調查追溯了機器人操作的完整發展歷程,從機械編程到學習型控制器再到具身智能。重點聚焦於當代靈巧系統的兩大研究方向:資料收集(模擬、人類示範、遙操作)和技能學習(模仿與強化學習)。論文確定了目前阻礙靈巧機器人進展的三個根本性障礙。
Prerequisites
- History of robot manipulation — the survey traces mechanical → programmed → learning-based → embodied; context for each transition helps
- Gripper design — parallel jaw → multi-finger → dexterous hands; understanding mechanical constraints contextualizes why learning approaches differ by gripper
- Sim-to-real transfer — simulation is used for pretraining/scaling; understanding the domain gap is necessary
- VLA (Vision-Language-Action) models — the survey culminates at VLAs as the frontier policy architecture; transformer-based multimodal models background helps
Core Idea
Robotic manipulation is undergoing a paradigm shift: the robot’s body, sensors, and world model are increasingly treated as a unified “embodied” system rather than separable components. This shift mirrors what happened in NLP (from pipelines to end-to-end transformers) and vision (from hand-crafted features to learned representations). The key implication: data collection and policy learning are no longer separate phases — they must be co-designed, and the data distribution determines what the policy can do.
Results
Survey findings:
- Historical arc: mechanical (1960s) → programmed (1980s) → learning-based (2010s) → embodied AI (2020s)
- Gripper evolution parallels: parallel jaw → multi-finger → dexterous (22+ DOF)
- Policy hierarchy: BC → GAIL → diffusion policies → VLA foundation models
- Three fundamental obstacles: not fully detailed in abstract (requires full paper access)
Limitations
- Author-stated: identifies three obstacles but does not fully resolve them (July 2025 snapshot)
- Unstated: the “embodied intelligence” framing may overstate the extent to which current VLAs have genuine world models vs. statistical pattern matching
Reproducibility
- Code: survey paper; references individual systems
- Datasets: references standard manipulation benchmarks across multiple categories
- Compute: not applicable (survey)
Insights
The framing as “embodied” rather than just “manipulation” reflects a field-wide shift that has significant implications: if the robot’s body and sensors are part of the model, then hardware choices become research decisions. A policy trained on one robot’s embodiment does not transfer to another. VLAs partially address this via language conditioning — the same model can be prompted differently per robot — but the embodiment gap remains a research-level problem.
Connections
- GR-Dexter: Bimanual Dexterous VLA
- How to Train Your Robots: Demonstration Modality
- Deep Generative Models Learning from Multimodal Demonstrations
- VLA (Vision-Language-Action) models
- sim-to-real transfer
Raw Excerpt
Surveys the evolution of robotic manipulation systems progressing from mechanical programming to embodied intelligence, alongside advances in gripper technology. Focuses on two primary research directions: data collection approaches and skill-learning methodologies.