Context
Discord discussion synthesizing four sources: How to Train Your Robots (ICRA 2025, arXiv 2503.07017), Dexterous IL Survey (2504.03515), Developments & Challenges (2507.11840), and the il-data-collection-map canvas. Covers four human demonstration paradigms and five technology support dimensions.
Discord 討論整合四篇論文與 canvas,針對四種 Human Demonstration 方法進行分類比較,並對應五大技術支援方向進行機制分析。
四種 Human Demonstration 方法 / Four Demo Paradigms
① Kinematic Teaching(動覺教學)
中文: 人直接用手推動機器人手臂,引導它走過預設軌跡,機器人被動記錄關節角度序列。機器人以 Cartesian Impedance 或重力補償模式運行,可被輕鬆推動。錄下的軌跡透過 Replay 機制轉換為 delta pose 指令。Action Consistency 三種方式最高(K-NN Action Variance 最低)。ICRA 2025 實測:Open Drawer 95%,Push Sanitizer 接觸力任務反轉墊底(Replay Jerkiness)。無法用於靈巧手(21-DoF),是物理硬性限制而非工程問題。
EN: The operator physically grasps and moves the robot arm through the desired trajectory while joint states are recorded passively. The robot runs in Cartesian impedance or gravity-compensation mode. Recorded trajectories are replayed as delta-pose commands. This yields the highest action consistency (lowest K-NN action variance) of all three modalities. ICRA 2025: 95% on Open Drawer, but worst on Push Sanitizer due to replay jerkiness. Cannot scale to dexterous hands (21-DoF) — a hard physical limit, not an engineering gap.
優點 / Pros: 最高資料品質、不需遙操設備、操作者最直覺、天然 haptic(直接感受機器人阻力) 缺點 / Cons: 體力消耗大、不可擴展至靈巧手、接觸任務 Jerkiness、State Diversity 低
② Teleoperation(遙操作,含 MoCap 即時控制)
中文: 操作者透過介面裝置即時遠端控制機器人,機器人映射人的動作並執行。人有即時反饋與修正能力,這是與 Natural Demo 的核心區別。Motion Capture 遙操作是子類——穿戴 MoCap 手套,系統即時將人手動作 retarget 到機器人手。State Diversity 高,Covariate Shift 嚴重(相機稍偏 → 成功率趨零)。代表設備:DexCap(EM 手套,速度 3× Leader Arm)、DOGlove(21-DoF)、Glovity(wrench feedback)、Leader Arm(~€225)。
EN: The operator remotely controls the robot in real time via an interface device. The robot mirrors the operator’s movements. Critically, the operator receives real-time feedback and can correct errors — this distinguishes it from Natural Demo. MoCap teleoperation is a subtype: gloves/sensors track the human hand and retarget its pose to the robot DOF space in real time. Yields high state diversity but is susceptible to covariate shift (camera offset → near-zero success). Key hardware: DexCap (EM glove, 3× faster than leader arm), DOGlove (21-DoF), Glovity (wrench feedback), Leader Arm (~€225).
優點 / Pros: 可擴展至靈巧手、State Diversity 高、即時反饋可修正、DexCap 速度 3× 缺點 / Cons: 動作品質有雜訊、需熟練操作員、Covariate Shift 嚴重、Retargeting 形態誤差
③ Natural Demonstration(自然示範)
中文: 人以「自然方式」執行任務,不穿戴控制機器人的介面,也不需機器人在場。感測系統捕捉人手姿態,事後離線 retarget 到機器人。與 Teleoperation 的核心差異:人沒有即時機器人反饋。代表:UMI(杯形裝置 + GoPro,6-DoF 腕部位姿,in-the-wild)、DexUMI(5 指靈巧,指節角度感測)、DexWild(野外靈巧互動)。動作最自然、收集最快,但缺乏 proprioception 且依賴 retargeting。
EN: The human performs the task naturally, as if with their own hands, without controlling any robot interface. Sensors capture hand pose/trajectory offline, then retarget to the robot. Critical distinction from teleoperation: the human has no real-time feedback from the robot. Representative systems: UMI (cup-shaped handle + GoPro, captures 6-DoF wrist pose in-the-wild), DexUMI (extends UMI to 5-finger dexterous hands with finger-joint sensing), DexWild (in-the-wild dexterous human interactions). Yields the most natural motions and fastest collection speed, but lacks proprioception and depends heavily on retargeting quality.
優點 / Pros: 動作最自然、收集快、適合靈巧任務、可 in-the-wild 大量收集 缺點 / Cons: 無即時反饋修正、Retargeting 誤差、需後處理、缺乏 proprioception
④ Passive Observation(被動觀察)
中文: 機器人或模型從純粹的人類影片中學習,不需任何特殊感測設備,不需人機互動。人只是正常執行任務被錄影,系統作為「旁觀者」學習。Embodiment Gap(體態差距)是核心障礙。主要用途是 VLA 大規模預訓練。代表:Human Policy(egocentric 影片)、Open X-Embodiment(多源影片整合)。
EN: The model learns from purely observational human video — no special sensors, no human-robot interaction. The human simply performs tasks while being filmed; the robot is a passive observer. Embodiment gap (human hand ≠ robot hand) is the fundamental obstacle — no action labels, no proprioception, no tactile data. Primary use case: large-scale VLA pretraining. Representative systems: Human Policy (egocentric video), Open X-Embodiment (multi-source video integration).
優點 / Pros: 資料近乎無限、零採集成本、最適 VLA 大規模預訓練 缺點 / Cons: 無動作標籤、Embodiment Gap 大、缺乏 proprioception 與觸覺
Key Insights
- 決策核心軸 / Core Trade-off Axis: Action Consistency(高 → ① Kinematic Teaching)vs State Diversity(高 → ② Teleoperation / ④ Passive Observation). ICRA 2025 best strategy: 30% Kinesthetic + 70% VR → ~20% higher success than either alone, capturing both properties simultaneously.
- Teleoperation vs Natural Demo 的判斷標準 / Distinguishing Criterion: 人有無即時控制/反饋機器人 — has real-time robot feedback → Teleoperation; no feedback, human acts freely → Natural Demo.
- Kinematic Teaching 硬性限制 / Hard Physical Limit: Cannot guide a 21-DoF dexterous hand; not an engineering problem but a fundamental boundary of the method.
- Replay Jerkiness 的根因 / Root Cause: Pure position replay encounters contact resistance but still drives toward the recorded joint target → oscillation. Fix requires integrating a F/T sensor to switch to hybrid force-position control at contact.
技術支援方法對應 / Technology Support Mechanisms
欄位縮寫 / Column abbreviations: ① Kinematic Teaching, ② Teleoperation (incl. MoCap), ③ Natural Demo (UMI/DexUMI), ④ Passive Observation
1. Simulation
1a. Physical Simulation(物理模擬)
Supporting mechanism: Simulation primarily serves to multiply demonstrations — a small number of human demos seed large synthetic datasets.
① Kinematic Teaching 幾乎不直接用:示範本身已在真實機器人上進行。可用 sim 事前驗證軌跡安全性,但不擴增資料。 Simulation rarely used directly; real-robot kinesthetic demos are already ground truth. Useful for pre-validating trajectory safety before live collection.
② Teleoperation 最主流應用:MuJoCo / IsaacSim 中 VR 遙操作收集 demo,MimicGen 識別物件中心的動作片段(抓握、插入),自動對新物件位置生成新軌跡,約 100 個示範 → 50K+ 合成資料。IsaacLab Mimic 提供全流程工具鏈。 Primary use: VR teleoperation in MuJoCo/IsaacSim, then MimicGen reads the demos, identifies object-centric action segments, and synthesizes new trajectories for novel object positions — ~100 human demos → 50K+ synthetic. IsaacLab Mimic provides end-to-end toolchain.
③ Natural Demo Real2Render2Real pipeline(Berkeley 2025):先掃描真實物件建 3DGS 模型,從人手示範影片估計腕部軌跡,讓虛擬機器人手在 sim 中依同一軌跡執行,合成有 action label 的配對資料,補上 natural demo 缺乏 proprioception 的問題。 Real2Render2Real: scan object → build 3DGS model → estimate wrist trajectory from human demo video → replay in sim with virtual robot hand → synthesize paired image-action data. This plugs the proprioception gap in natural demo.
④ Passive Observation 最直接受益:GPU 加速並行 sim 生成大規模影片作為 VLA 視覺預訓練基礎。RoboCasa365 用 MimicGen 合成 1,615 小時資料。 Most direct beneficiary: GPU-parallel sim generates massive video for VLA visual pretraining. RoboCasa365 synthesized 1,615 hours via MimicGen.
1b. Tactile Simulation
Supporting mechanism: TacEx (2025) chains two simulators — a soft-body physics engine computes contact deformation, then a visual-tactile renderer converts deformation to GelSight images. Policies trained purely in sim on tactile-action pairs transfer to real robots because the domain gap between sim and real GelSight output is small.
① Kinematic Teaching — 少用:直接操作時人機皆有感測,無需 sim 補充。 Rarely needed; both human and robot have direct contact sensing during kinesthetic demos.
② Teleoperation — TacEx 支援在 sim 中訓練 GelSight 策略再 sim-to-real 遷移。FOTS(Fast Optical Tactile Simulator)加速光學觸覺感測器 sim。 TacEx enables training tactile policies in sim, then deploying on real GelSight sensors. FOTS accelerates optical tactile sensor simulation.
③ Natural Demo — 難以整合:人手觸感無法直接對應機器人感測器,需跨模態映射,目前無成熟方案。 Hard to integrate: human hand contact signals don’t map directly to robot sensor output; no mature cross-modal pipeline yet.
④ Passive Observation — 無觸覺資料,tactile sim 可作補充資料源但體態差距大。 No tactile data available; tactile sim could supplement but embodiment gap is large.
1c. Gaussian Splatting(3DGS)
Supporting mechanism: 3DGS records scene geometry and appearance in a form that can be edited — object positions, lighting, viewpoint — without re-simulating contact physics. This makes it ideal for visual diversification of existing demos.
① Kinematic Teaching — RoboSplat(RSS 2025):掃描一次場景 → 修改 Gaussian 位置和光線 → 每個 kinesthetic demo 可渲染出多種視覺變體,解決示範量少的問題。 RoboSplat: scan once → edit Gaussian positions and lighting → render visual variants of each kinesthetic demo, addressing the low-volume problem.
② Teleoperation — 最強受益:Real2Render2Real 輸入多視角物件掃描 + 單目人類示範影片 → 合成多樣化 domain-randomized 機器人執行資料。直接解決 Covariate Shift。GSWorld 提供閉環光真實操作模擬環境。 Strongest benefit: Real2Render2Real takes multi-view object scan + monocular demo video → synthesizes diverse domain-randomized robot execution data. Directly addresses covariate shift. GSWorld provides closed-loop photorealistic simulation.
③ Natural Demo — Real2Render2Real 的 monocular human video 輸入路徑直接支援:natural demo 影片 + 場景掃描 → 有 action label 的機器人訓練資料。 Real2Render2Real’s monocular video input path directly serves natural demo: demo video + scene scan → robot training data with action labels.
④ Passive Observation — GSWorld(2510.20813):3DGS 建立閉環光真實模擬環境,讓被動觀察影片在受控環境重播與增強。 GSWorld builds a closed-loop photorealistic sim from 3DGS, allowing passive-observation video to be replayed and augmented in controlled conditions.
2. Multi-modal Sensing
2a. Audio
目前應用最少,但 Sparsh-X(Meta, 2025)展示接觸音是觸覺感知的四種模態之一(image / audio / motion / pressure),從約 100 萬次接觸互動訓練統一表示,提升 policy success rate 63%。應用最可能在 ② Teleoperation 中整合感測(接觸音可補充視覺無法感知的接觸品質),以及 ④ Passive Observation(影片聲軌提供人手接觸品質的弱監督訊號)。
Currently least deployed. Sparsh-X (Meta 2025) treats contact sound as one of four tactile modalities; its unified representation trained on ~1M contact interactions boosts policy success by 63%. Most applicable to ② Teleoperation (contact sound supplements visual contact quality assessment) and ④ Passive Observation (video soundtrack as weak supervision for contact quality).
2b. Force / Torque
Supporting mechanism per approach:
① Kinematic Teaching — 最關鍵缺口:Replay Jerkiness 的根本原因是純位置控制在接觸時機器人遇到阻力仍試圖到達預設位置。整合腕部 F/T sensor 後,偵測到外力時切換為力-位置混合控制,讓 Replay 在接觸時自然「讓步」。 Most critical gap here: replay jerkiness arises because pure position control drives toward the target even when contact resistance occurs. Integrating a wrist F/T sensor enables switching to hybrid force-position control at contact — the robot yields naturally instead of oscillating.
② Teleoperation(Glovity) — 關鍵洞察:把腕部 wrench 作為策略的「觀測輸入」而非控制信號。策略學到「接觸時腕部應有什麼 wrench 分佈」這個豐富接觸語義,而不只學位置軌跡,讓策略在接觸豐富任務上更魯棒。 Key insight: wrist wrench is used as a policy observation, not a control signal. The policy learns what wrench distribution the wrist should experience at contact — rich contact semantics beyond position trajectories, improving robustness on contact-rich tasks.
③ Natural Demo — TacCap(Stanford)的同感測器方案:FBG 光纖指尖感測器同時裝在人手和機器人手指,量測同一物理量(光纖應變),domain gap 趨近於零,無需跨模態轉換。 TacCap’s same-sensor approach: identical FBG fiber-optic fingertip sensors on both human and robot fingers measure the same physical quantity — domain gap approaches zero, no cross-modal conversion needed.
④ Passive Observation — 無法從影片取得,是接觸豐富任務的硬性缺口。 Cannot be obtained from video — hard gap for contact-rich tasks.
2c. Tactile
Supporting mechanism: Tactile sensing enables contact-state awareness that vision cannot provide. The key design choice is how to bridge the human-robot domain gap.
① Kinematic Teaching — 少用;可整合 tactile glove 記錄人手接觸分佈,再映射至機器人觸覺感測器,但跨模態轉換難度高。 Rarely used; can integrate tactile gloves to record human contact distribution, but cross-modal mapping to robot sensors is non-trivial.
② Teleoperation — 最豐富生態:OSMO(Meta FAIR,磁性觸覺陣列)、TacCap(Stanford,FBG 指尖)、DOGlove(5-DoF haptic force)、TAG Glove(EOP 觸覺陣列,bidirectional teleoperation)、DexCap(EM 手套)。ICRA 2025 ViTac Workshop 顯示觸覺整合對接觸豐富任務成功率有決定性影響。 Richest ecosystem: OSMO (magnetic array), TacCap (FBG fingertip), DOGlove (5-DoF haptic), TAG Glove (EOP array, bidirectional), DexCap (EM glove). ICRA 2025 ViTac Workshop confirmed tactile integration is decisive for contact-rich task success.
③ Natural Demo(OSMO 模式) — 人自然操作時磁性觸覺陣列同步記錄接觸分佈,Meta FAIR 展示完全不需真實機器人資料即可訓練靈巧策略。 OSMO mode: magnetic tactile array records contact distribution during natural human manipulation. Meta FAIR demonstrated that dexterous policies can be trained with zero real-robot data using this approach.
④ Passive Observation — 完全缺失,是接觸豐富任務性能上限的核心障礙。 Completely absent — core bottleneck for contact-rich task performance ceiling.
2d. Vision
所有四種方法的基礎模態。關鍵差異在於視角設計:① 固定工作台相機;② 可帶頭戴立體視覺(Open-TeleVision);③ UMI 用腕部 GoPro in-the-wild 第一人稱視角;④ 任何人類影片視角(外部攝影機 / egocentric)。
Foundation modality for all four approaches. Key differences lie in viewpoint design: ① fixed workspace cameras; ② head-mounted stereo vision (Open-TeleVision); ③ UMI uses wrist-mounted GoPro for in-the-wild first-person view; ④ any human video viewpoint (third-person or egocentric).
3. Multi-modal Feedback(給示範者的回饋 / Feedback to Demonstrator)
3a. Vision Feedback
① Kinematic Teaching — 操作者在場,直視機器人,無需額外視覺裝置。 Operator is physically present; direct visual observation of the robot — no additional device needed.
② Teleoperation — 最多樣:Open-TeleVision(立體視覺頭戴,提供深度感知);ARCap(AR 疊加虛擬機器人手預期軌跡,Stanford 2024:非專業用戶示範品質提升 40%+);一般 2D 螢幕。ARCap 的機制:AR 視野中顯示「照這樣做機器人才能複製」的引導路徑,補償技能轉換的認知負擔。 Most diverse: Open-TeleVision (stereo HMD with depth), ARCap (AR overlay of expected robot trajectory — Stanford 2024: 40%+ improvement for non-expert demonstrators by compensating skill-transfer cognitive load), standard 2D screen.
③ Natural Demo — 無:人執行自然任務,不需要看機器人狀態。這是此方法的設計原則,也是其天花板限制之一。 None by design — the human acts freely without knowing robot state. This is both the method’s advantage (natural motion) and a ceiling limitation (no correction possible).
④ Passive Observation — N/A。
3b. Audio Feedback
目前所有方法中應用極少,尚未成為主流回饋通道。部分研究用接觸音(刮擦、碰撞聲)作為任務完成的弱監督訊號。 Minimal deployment across all methods; not yet a mainstream feedback channel. Some work uses contact sound (scraping, collision) as a weak supervision signal for task completion.
3c. Haptic Feedback
Supporting mechanism: Haptic feedback closes the sensing loop for the demonstrator — without it, the operator only has visual cues to judge contact, which has latency and lacks force magnitude information.
① Kinematic Teaching(天然 Haptic 優勢) — 操作者手直接感受到機器人的阻力、彈性、接觸反力,人的本體感知系統自動調整施力。這是 Action Consistency 最高的隱性原因之一,也解釋了為何接觸力任務的 Jerkiness 問題在 Replay 時消失——Replay 時這個天然 haptic loop 不再存在。 Natural haptic advantage: the operator’s hand directly feels the robot’s resistance, compliance, and contact forces. The proprioceptive system auto-adjusts force application — this is one hidden reason for Kinematic Teaching’s superior action consistency. Replay jerkiness occurs precisely because this natural haptic loop is absent during replay.
② Teleoperation — 核心研究方向:DOGlove(5-DoF 震動馬達:機器人手指偵測接觸力 → 對應震動馬達啟動 → 操作者「感受到」接觸,自然控制施力大小);Glovity(6D wrench 回饋:機器人腕部 wrench 向量回傳給操作者手腕 → 操作者感受到和機器人相同的力矩分佈,自然引導做出正確接觸姿勢);TAG Glove(EOP 觸覺陣列,雙向)。 Core research direction: DOGlove (5-DoF vibration motors: robot finger contact → corresponding vibration → operator “feels” contact, naturally regulates force); Glovity (6D wrench feedback: robot wrist wrench vector returned to operator wrist → operator experiences same torque distribution as robot, naturally guides correct contact posture); TAG Glove (EOP tactile array, bidirectional).
③ Natural Demo — UMI / DexUMI / DexWild 均無 haptic feedback,這是精細接觸任務的天花板限制。人的本體感知只來自人手本身,和機器人接觸狀態完全解耦。 UMI/DexUMI/DexWild all lack haptic feedback. The human’s proprioception only reflects their own hand, completely decoupled from the robot’s contact state — ceiling limitation for fine contact tasks.
④ Passive Observation — N/A。
4. End-effector Morphology
4a. Gripper Type → Demo Method Compatibility
末端執行器形態和 Demo 方法之間是雙向制約:形態決定哪種方法可行;Demo 方法反過來也影響末端執行器設計方向(Natural Demo 導向「手形盡量接近人手」的靈巧手設計)。
End-effector morphology and demo method are bidirectionally constrained: morphology determines which methods are feasible; demo methods in turn influence end-effector design (Natural Demo drives toward human-hand-like morphology).
平行夾爪(1-2 DoF): ① Kinematic Teaching 最自然(捏合夾爪直觀);② Teleoperation(Leader Arm);③ Natural Demo(UMI 杯形裝置);④ Passive Observation(Embodiment Gap 相對小)。
靈巧手(16-21 DoF): ① 完全不可行;② Teleoperation(DexCap / DOGlove);③ Natural Demo(DexUMI / DexWild);④ 可用但 Embodiment Gap 極大。
DoF 每增加一個量級,IL 訓練資料需求增加 10-50 倍(夾爪 200-500 demos;21 DoF 靈巧手需 1,000-5,000 demos),因為高維動作空間要求更大的狀態覆蓋率。 Each order-of-magnitude increase in DoF multiplies demo requirements by 10-50× (parallel gripper: 200-500 demos; 21-DoF dexterous hand: 1,000-5,000 demos) — the action space dimensionality explosion demands proportionally broader state coverage.
4b. Material(軟硬材質 / Soft vs. Rigid)
軟性末端執行器降低接觸敏感性、容錯率高,但讓 ① Kinematic Teaching 的 Replay 更難預測——變形動力學無法被純關節角度完整捕捉,Replay 軌跡和實際動作不一致。剛性末端執行器的 Replay 較穩定,這是 Kinesthetic Teaching 多數實驗用剛性夾爪的原因。npj Robotics(2026)的 tactile-reactive gripper 整合 active palm,可同時感測和施力,代表軟硬整合的前沿方向。
Soft end-effectors reduce contact sensitivity and improve fault tolerance, but make Kinematic Teaching replay harder to predict — deformation dynamics cannot be fully captured by joint angles alone. Rigid end-effectors yield more stable replay, which is why most kinesthetic teaching experiments use rigid grippers. The tactile-reactive gripper (npj Robotics 2026) with integrated active palm represents the frontier of soft-rigid integration.
4c. Wrist(腕部自由度)
高 DoF 腕部是 ② Teleoperation 的強項:Glovity 的 wrench feedback 正是針對腕部扭力設計的,讓複雜腕部旋轉和施力精確捕捉。③ Natural Demo(UMI) 的核心資料就是 6-DoF 腕部位姿,腕部形態設計直接影響 UMI 裝置能捕捉的動作範圍。① Kinematic Teaching 難以捕捉精細腕部旋轉(人手引導腕部的力矩控制不直觀)。
High-DoF wrists are Teleoperation’s strength: Glovity’s wrench feedback is specifically designed for wrist torque, enabling precise capture of complex wrist rotations and force application. UMI’s core data is 6-DoF wrist pose — wrist morphology directly constrains what UMI can capture. Kinesthetic teaching struggles with fine wrist rotation because guiding wrist torque is unintuitive.
4d. Number of Fingers(手指數)
- 1-2 DoF(夾爪)→ ① Kinematic Teaching 首選 / best for kinesthetic
- 5 指靈巧手(21+ DoF)→ 必須用 ② Teleoperation(DexCap/DOGlove)或 ③ Natural Demo(DexUMI)/ requires teleoperation or natural demo
- 人形機器人全手 → Open-TeleVision 全身追蹤 Teleoperation / full-body teleoperation
- ④ Passive Observation 面對指數增長的 Embodiment Gap 問題 / faces exponentially growing embodiment gap
5. Motion Retargeting
5a. Retargeting 誤差的本質 / Root Cause of Retargeting Error
人手和機器人手的「同名關節」不在同一位置:即使指尖位置對齊了,PIP/DIP 等中間關節構型可能完全不同。只 retarget 指尖位置(FK→IK 模式):指尖對了,但中間關節的 IL 訓練資料不可靠。Retarget 全關節構型(Keyvector 優化模式):視覺相似性高,但計算量大,且最佳化可能失敗。
The root cause: homonymous joints on human and robot hands are not at the same location. Even when fingertips align, intermediate joints (PIP, DIP) may have completely different configurations. Fingertip-only retargeting (FK→IK) gives correct fingertips but unreliable intermediate joint data for IL. Full-configuration retargeting (Keyvector optimization) preserves visual similarity but is computationally intensive and may fail to converge.
5b. 各 Demo 方法中 Retargeting 的角色 / Role per Demo Method
① Kinematic Teaching — 完全不需要:直接記錄機器人自己的關節角度,是唯一無 retargeting 誤差的方法。 Not needed at all: records the robot’s own joint angles directly — the only method with zero retargeting error.
② Teleoperation(MoCap 類) — 即時 retargeting:DexCap 用 EM 手套追蹤人手關節 → FK 算指尖位置 → IK 求機器人手關節角度。ByteDexter 用 Keyvector 加權優化,同時抑制「非自主動作」(如手指意外顫抖)以防止機器人碰撞。Retargeting 在示範進行中即時計算,誤差會立即影響操作品質。 Real-time retargeting: DexCap tracks hand joints via EM → FK computes fingertip positions → IK solves robot joint angles. ByteDexter uses Keyvector weighted optimization to simultaneously suppress involuntary motions (unintentional tremors) that risk robot collisions. Errors affect demo quality immediately.
③ Natural Demo — 最依賴 retargeting,且是離線批次處理:UMI retargets 6-DoF 腕部位姿;DexUMI retargets 5 指關節構型;DexWild retargets 整手構型。Retarget 失敗的幀若不過濾,會在 policy 中留下「錯誤示範」直接損害訓練品質。 Most dependent on retargeting, and done offline in batch: UMI retargets 6-DoF wrist pose; DexUMI retargets 5-finger joint configurations; DexWild retargets full hand configuration. Unfiltered failed-retargeting frames become “corrupted demonstrations” that directly harm training quality.
④ Passive Observation — 最困難:影片手部 pose estimation 本身有誤差,再加一層 retargeting,兩個誤差源疊加,是此方法在靈巧任務上效果差的重要原因。 Hardest case: hand pose estimation from video already carries errors; adding retargeting stacks a second error source. This compounded error is a major reason passive observation underperforms on dexterous tasks.
5c. 最新 Retargeting 技術(2025)/ Latest Methods
-
DexFlow(arXiv 2505.01083):傳統 retargeting 只看幾何相似度,不知道手和物件的接觸關係,指尖容易穿透物件表面。DexFlow 先估計人手-物件接觸地圖,把接觸約束加入 retargeting 優化,讓機器人手的接觸點和人手對應且不穿透。 Addresses fingertip penetration: first estimates human hand-object contact maps, then incorporates contact constraints into retargeting optimization — robot contact points correspond to human contact points without interpenetration.
-
ManipTrans(CVPR 2025):幾何 retargeting 後,讓 RL agent 在物理 sim 中做殘差修正,不需設計精細優化目標,RL 自動找到讓動作物理可行的小幅調整。可跨 Shadow / Inspire / MANO 體態轉移。 After geometric retargeting, an RL agent makes residual corrections in physical simulation — no need to hand-design optimization objectives; RL finds physically valid adjustments automatically. Cross-embodiment transfer: Shadow/Inspire/MANO.
-
DexMachina(arXiv 2505.24853):功能性 retargeting for bimanual,以任務完成性(functional fidelity)為最佳化目標,而非視覺相似性。 Functional retargeting for bimanual tasks: optimizes for task completion fidelity rather than visual similarity.
-
指尖捏合目標(Fingertip Pinch Objective)(arXiv 2506.09384):分析顯示指尖捏合目標對精細操作最關鍵;指尖方向(orientation,而非僅位置)在需要特定接觸角度的任務中也不可省略。 Analysis shows the fingertip pinch objective is most critical for fine manipulation; fingertip orientation (not just position) is essential for tasks requiring specific contact angles.
5d. 誤差緩解策略 / Error Mitigation Strategies
- 同感測器方案(TacCap):人手和機器人手裝相同感測器,domain gap 趨近於零。 Same-sensor approach: identical sensors on human and robot; domain gap approaches zero.
- Null-space Secondary Objective:IK 求解時加入中間關節構型約束,讓 PIP/DIP 接近人手姿態。 Add intermediate joint configuration constraints to IK null-space; drives PIP/DIP toward human-like configurations.
- 殘差 RL 修正層(ManipTrans):幾何 retargeting 後加 RL 修正,保證物理可行性。 Residual RL correction layer after geometric retargeting guarantees physical feasibility.
- 接觸地圖約束(DexFlow):顯式建模人機手指接觸點對應關係,防止穿透和滑動。 Explicit contact map constraints model human-robot finger contact correspondence, preventing interpenetration and slipping.
任務難度 / 靈巧度分級 / Task Complexity & Dexterity Levels
Sources: 2510.10903 §4, 2504.03515, 2507.11840, 2503.07017, il-data-collection-map canvas
分級原則:隨等級升高,所需的感測模態、DoF、時間範圍遞增,且每個等級代表一個「純位置控制策略開始失效」的臨界點。 Grading principle: as level increases, required sensing modalities, DoF, and temporal horizon all grow. Each level represents a threshold where the previous control strategy (pure position control, single arm, rigid body) starts to break down.
Lv 1 — 非接觸搬運 (Basic Pick & Place)
中文: 單物件操作,運動過程中無接觸約束(自由空間軌跡),容差寬鬆(> 5 mm),2-DoF 夾爪即可,不需調控接觸力。
EN: Single-object manipulation with no contact constraints during transit (free-space trajectory). Coarse tolerance (>5 mm). 2-DoF gripper sufficient. No force modulation needed.
代表任務 / Representative tasks:
- 從桌面抓取物件並搬到指定位置 / Grasping and transporting objects to a target location
- 積木堆疊(粗精度)/ Coarse block stacking
- 依顏色 / 形狀排序 / Sorting by color or shape
- Open X-Embodiment 基礎任務集 / OXE foundational task set
為什麼同一等級 / Why grouped: 任務只需要正確的抓取姿態估計和自由空間軌跡規劃。機器人不需要在搬運過程中維持或調控任何接觸力,失敗模式只有「抓不到」和「放錯位置」。策略只需要學習位置對應。The task only requires accurate grasp pose estimation and free-space trajectory planning. No contact force maintenance needed during transit. Failure modes are only “failed grasp” and “wrong placement.” The policy only needs to learn position correspondences.
Demo 方法: 全部四種方法皆可 / All four methods applicable. Passive Observation via VLA pretraining is sufficient.
Lv 2 — 精度放置 / 接觸引導 (Precision Placement / Contact-Guided)
中文: 目標位置有幾何約束,容差 1-5 mm,接觸在任務結束時不可避免但短暫,需要接近路徑規劃,但接觸本身不需要持續力調控。
EN: Target location has geometric constraints, tolerance 1-5 mm. Contact at the final position is inevitable but brief. Requires approach path planning, but no ongoing force regulation at contact.
代表任務 / Representative tasks:
- Open Drawer (ICRA 2025: Kinesthetic 95%, Hybrid 100%) — 拉手把進入受限空間 / pulling handle into constrained space
- Peg-in-Hole、USB 插入 / Peg-in-hole insertion, connector plugging
- 杯子疊放(精確對齊)/ Precise cup stacking
- 蓋子放置與移除 / Lid placement and removal
- 插 SD 卡、連接器插拔 / SD card insertion, connector plugging
為什麼同一等級 / Why grouped: 比 Lv1 難的地方只在最終位置精度,而非力控。接觸是「到達即完成」的一次性事件,不需維持。從 ICRA 2025 結果看,Kinesthetic Teaching 在此等級表現最強(Action Consistency 高)。Harder than Lv1 only in final position precision, not force control. Contact is a one-time “arrival” event, not maintained. ICRA 2025 results show Kinesthetic Teaching is strongest here due to high action consistency.
Demo 方法: ① Kinematic Teaching 最強;② Teleoperation 可行;③ Natural Demo(UMI)對夾爪任務可行。
Lv 3 — 接觸力調控 (Contact-Force Regulation)
中文: 接觸必須在執行過程中持續維持並動態調控。純位置控制不足,需要 F/T sensing 或 impedance control。這是 Kinematic Teaching Replay Jerkiness 最明顯的臨界等級。
EN: Contact must be continuously maintained AND dynamically regulated during execution. Pure position control is insufficient; F/T sensing or impedance control required. This is the level where Kinematic Teaching replay jerkiness causes critical failure.
代表任務 / Representative tasks:
- Push Sanitizer (ICRA 2025: Kinesthetic 35%, VR Teleoperation 55% — reversed ranking) — 推著物件移動 / pushing object while maintaining contact force
- Flip Glass (ICRA 2025: Kinesthetic 70%, Hybrid 75%) — 翻轉玻璃杯 / flipping a glass, contact point evolves throughout
- 擦拭 / 清潔平面 / Surface wiping/cleaning (must maintain contact pressure while moving)
- 倒水 / 倒液體 / Pouring (tilt + contact at vessel rim)
- 轉動門把 / Turning a door handle (rotational force + geometric constraint)
- 鎖螺絲(力矩 + 旋轉)/ Screw tightening (torque + rotation)
- 2507.11840 描述的 contact-rich manipulation 任務 / contact-rich tasks described in 2507.11840
為什麼同一等級 / Why grouped: 挑戰從「位置精度」轉變為「力的時序調控」。Replay Jerkiness 在此等級暴露:機器人遭遇接觸阻力時,純位置控制強制推向目標位置 → 抖動。需要 F/T sensor + 力控切換。The challenge shifts from positional accuracy to temporal force regulation. Replay jerkiness fully manifests here: contact resistance with pure position control forces toward the target → oscillation. Requires F/T sensor + hybrid force-position control switching.
Demo 方法: ② Teleoperation(帶 haptic feedback 的 Glovity)最適合;① Kinesthetic Teaching 在接觸力任務失效;③ Natural Demo 搭配觸覺感測(TacCap/OSMO)可行;④ Passive Observation 幾乎無法獲得接觸力資訊。
Lv 4 — 多指手內操作 (Multi-finger In-Hand Manipulation)
中文: 物件的運動發生在「手的內部」,需要 5 根手指、16-21 DoF。平行夾爪無論策略多強都無法完成,手指協調和對立(opposition)是核心。
EN: Object motion happens within the hand, not driven by arm motion. Requires 5 fingers, 16-21 DoF. A parallel gripper is physically incapable regardless of policy quality. Finger coordination and opposition are central.
代表任務 / Representative tasks:
- 轉筆 / 手中旋轉物件 / Pen spinning, in-hand object rotation
- 硬幣手指滾動 / Coin rolling across fingers
- 鑰匙插入後在鎖內旋轉 / Key turning in lock
- 指尖鎖緊螺帽 / Fingertip nut tightening
- 剝香蕉皮 (2504.03515 引用) — 兩指協調剝離 / Banana peeling — two-finger coordinated stripping
- 扣鈕扣 / Buttoning (fingertip precision + force)
- 開瓶蓋(指尖旋轉)/ Bottle cap removal (fingertip rotation)
- il-data-collection-map canvas 中的 靈巧手指(21-DoF+、DexHand、Shadow Hand)
為什麼同一等級 / Why grouped: 這是夾爪的物理邊界——2-DoF 夾爪的自由度根本不夠完成物件在手中的重新定向,和策略能力無關。控制空間維度從 6-DoF 跳升至 21+ DoF。2504.03515 將此定義為 imitation learning 的最前沿挑戰。This is the physical boundary of parallel grippers — DoF is fundamentally insufficient for in-hand reorientation, regardless of policy quality. Control space dimensionality jumps from 6-DoF to 21+ DoF. 2504.03515 identifies this as the frontier challenge for imitation learning.
Demo 方法: 只有 ② Teleoperation(DexCap / DOGlove)或 ③ Natural Demo(DexUMI / DexWild)可行;① Kinematic Teaching 完全不可行;④ Passive Observation 因 Embodiment Gap 極大而效果不佳。
Lv 5 — 長時程複合 / 雙臂 / 可變形 (Long-horizon / Bimanual / Deformable)
中文: 超出單臂能力、需要跨步驟狀態追蹤、或涉及剛體假設完全失效的可變形物體。通常三種特性同時出現。這是目前機器人學習的研究前沿。
EN: Exceeds single-arm capability, requires cross-step state tracking, or involves objects where the rigid-body assumption completely breaks down. Usually combines multiple properties. Current research frontier.
代表任務 / Representative tasks:
長時程多步驟 / Long-horizon multi-step:
- 蝦子烹飪 (canvas 提及) — 洗、去殼、切、烹,每步狀態依賴前一步 / Shrimp cooking: wash, shell, cut, cook — state-dependent chain
- 洗碗 (Mobile ALOHA) — 拿起、沖洗、擺放,跨移動平台 + 操作 / Dishwashing: multi-step + mobile platform
- 泡茶、料理準備 / Tea brewing, meal preparation
雙臂協調 / Bimanual coordination:
- 折疊衣物 (2504.03515 和 canvas 均提及) — 雙臂協調 + 布料可變形 / Cloth folding: two arms + deformable fabric
- 打結 / 蝴蝶結 — 繩子可變形 + 兩手精確配合 / Knot tying: deformable rope + precise bimanual coordination
- 裝配任務(一手固定、一手操作)/ Assembly: one hand holds, one operates
- il-data-collection-map canvas 中的 雙臂操作(Bimanual)群組
可變形物體 / Deformable objects:
- 揉麵團 (2510.10903 §4 明確列出) / Kneading dough
- 折疊布料 / Fabric folding
- 切軟性食材 / Cutting soft food
人形機器人任務 / Humanoid tasks (highest complexity):
- 整理桌面(GR00T N1.5 / π₀)/ Table tidying
- 協助穿衣(接觸 + 可變形 + 人體安全)/ Dressing assistance
- 2507.11840 稱此為 embodied intelligence 的終極目標 / 2507.11840 identifies this as the ultimate goal of embodied intelligence
為什麼同一等級 / Why grouped: 這些任務在多個維度同時超越前四等級的假設:任務時域超過單一 action chunk;物件物理模型不再是剛體;或需要兩個操作器的協調。任意一個特性都大幅提升難度,而 Lv5 任務通常同時包含多個。These tasks simultaneously violate the assumptions of all previous levels: temporal horizon exceeds a single action chunk; object physics is no longer rigid-body; or two manipulators must coordinate. Any one property dramatically increases difficulty; Lv5 tasks usually combine multiple.
Demo 方法: 長時程 → Teleoperation(VR 操作者需有完整任務認知)或 Passive Observation(人類料理影片 → VLA 預訓練);雙臂 → Teleoperation(ALOHA 雙臂遙操)最適合;可變形物體 → Teleoperation + 觸覺感測是目前最可行路線。
POMDAR 靈巧度定義與任務分群邏輯 / POMDAR Dexterity Definition & Task Grouping
靈巧度定義 / Definition of Dexterity
POMDAR 從復健醫學文獻出發,定義靈巧度為: 「針對功能性物件操作的協調自主運動,強調速度與任務完成」 “Coordinated voluntary movement for functional object manipulation, emphasizing speed and task completion.”
論文進一步將其操作化為:靈巧度是 hand × task 的交互結果,而非手本身的靜態運動學屬性(關節數量、工作空間大小)。必須以「實際接觸操作任務的表現」評估。 Dexterity is the interaction between motor control and external task constraints — not a static kinematic property. Must be evaluated through actual contact-rich task performance.
四類任務的分群邏輯 / Four-Category Rationale
關鍵認知:POMDAR 的四類別不是「難易度排列」,而是「靈巧度的不同運動維度」——每類隔離一種特定手部運動能力,機械導引禁止補償性策略(例如用手臂代替手指完成任務)。 The four categories are not difficulty tiers but dexterity dimensions — each isolates a specific hand motion capability. Mechanical scaffolding prevents compensatory strategies (e.g., using arm motion to substitute for finger motion).
Vertical(V1-V3)+ Horizontal(H1-H5)→ In-hand Dexterity(手內靈巧度) 來自 Elliott & Connolly 操作分類學。V 系列測試協調手內角度調整(±15° → ±45°);H 系列測試沿曲率遞增軌道的多指協調。兩者共同代表「手內物件重定向」能力。 From Elliott & Connolly’s manipulation taxonomy. V-series: coordinated in-hand angular adjustment (±15° to ±45°). H-series: multi-finger coordination along increasingly curved rails. Both represent in-hand object reorientation.
Continuous Rotation(C1-C4)→ Sustained Rotational Control(持續旋轉控制) 重力離合機制使任務只能透過持續手指旋轉推進,無法用單次大幅動作完成。論文明確說這類整體上「最困難」。 Gravity-based clutch mechanism: tasks can only advance via continuous finger rotation, not single large motions. Explicitly stated as “generally the most challenging” category.
Grasping(G1-G6)→ Grasp Quality(抓握品質) 唯一無外部導引的類別,測試抓起後穩定搬運不掉落。衍生自 Feix GRASP 33 種類型,是其他三類的基礎先決能力。 Only category without external scaffolding. Tests grasp stability during relocation. Derived from 33 Feix GRASP types; prerequisite for all other categories.
隱含難度順序(從實驗結果推導)/ Implied Difficulty Order G(抓取)→ V/H(手內調整)→ C(持續旋轉)
實驗關鍵發現(ORCA 手,1,140 條遙操作軌跡):
- 2-finger (5 DoF):只能完成 G 系列,V/H/C 全失敗
- 5-finger without abduction:能完成部分 V/H,但 C 系列仍受限
- Full 5-finger (16 DoF):在 abduction(拇指對立、手指展開)相關任務決定性勝出
核心結論:Abduction DoF 是跨越 G → V/H/C 邊界的物理根因。 論文明確說各任務的提升是 task-dependent 的——abduction 能力決定了哪些靈巧度維度可以解鎖。
POMDAR Benchmark 對應 (arXiv 2604.09294)
POMDAR(A Benchmark of Dexterity for Anthropomorphic Robotic Hands)提供了一個從人類運動控制分類學出發、包含 18 個任務的標準化靈巧度基準,直接驗證了上述分級框架。
POMDAR 任務 × 本分級對應:
Lv 1-2(基礎抓取 / 精度放置)→ G1-G6 純抓取任務 6 個抓取任務(圓柱體、球形、圓盤,不同尺寸),衍生自 Feix GRASP 33 種類型。這些任務只測試抓握穩健性,無需手內重定向。2-DoF 夾爪可部分完成,2-finger (5 DoF) ORCA 手的評測中只有 G1-G6 可達成。
Lv 3(接觸力調控)→ H1-H5 水平操作任務 沿曲線軌道的操作(Scissors / Chopsticks / Palmar / Pinch / Squeeze),需要多指協調維持接觸並調控力,軌道曲率遞增代表難度遞升。需要力的持續調控而非純位置控制。
Lv 4(多指手內操作)→ V1-V3 垂直任務 + C1-C4 連續旋轉任務
- V1-V3(Wheel/Stick/Sphere):手內角度調整,約束角從 ±15° 到 ±45° 遞增。這正是「物件在手中重定向」的核心定義。
- C1-C4(Thread/Stick/Wheel/Fidget):持續旋轉控制,重力離合機制測試持久精細旋轉,是手內操作的最高要求。
POMDAR 對分級框架的關鍵驗證:
- 2-finger (5 DoF) 只能完成 G 系列,完全無法完成 V/H/C → 直接驗證了 Lv3/Lv4 需要靈巧手(≥ 16 DoF)的硬性邊界
- Full 5-finger (16 DoF) 在 abduction(拇指對立、手指展開)任務上有決定性優勢 → Abduction DoF 是 Lv3→Lv4 邊界的物理根因
- 人類基準(6 位受試者,MoCap 手套記錄,3 試 / 任務)提供了絕對分數參考,使跨設計比較成為可能
Taxonomy 來源: Elliott & Connolly 13 種操作協調模式、Ma & Dollar 14 種(加入 Finger Pivoting/Tracking)、Feix GRASP 33 種抓取類型。33 種抓取類型中有 16 種已出現在操作模式中,說明操作與抓取在生物力學層面高度耦合。
等級-Demo方法 覆蓋矩陣 / Level × Method Coverage Matrix
Lv1 Lv2 Lv3 Lv4 Lv5
① Kines ✅ ✅ ❌ ❌ ❌
② Tele ✅ ✅ ✅ ✅ ✅
③ Nat ✅ ✅ ✅* ✅ ⚠️
④ Pass ✅ ⚠️ ❌ ❌ ⚠️
*③ Lv3 需搭配觸覺感測(TacCap / OSMO)/ requires tactile sensing integration
末端執行器硬性對應 / End-effector hard constraints:
- 夾爪(≤ 6-DoF)→ Lv1–Lv3 上限
- 靈巧手(16-21 DoF)→ Lv4 必要條件
- 雙臂系統 → Lv5 雙臂任務必要條件
Connections
- How to Train Your Robots Demonstration Modality
- dexterous-manipulation-imitation-learning-survey
- dexterous-embodied-robotic-manipulation-survey
- il-data-collection-map
- mocap-glove-fingertip-admittance-vs-retargeting
- human-demo-collection-simulation-across-domains
- dexterous-hand-tactile-data-collection-devices-lfd
- robot-manipulation-unified-survey-2510-10903
- pomdar-dexterity-benchmark-anthropomorphic-robotic-hands