Tracking Dexterous Hands: A Practitioner's Guide to Motion Capture for Robot Learning

本文由 AI 分析生成

建立時間： 2026-03-27 來源： https://x.com/shumochu/status/2036564960080437702

Summary

GI Labs’ practitioner’s survey of motion capture systems for dexterous robotic manipulation data collection, covering 12 recent systems and their sensing trade-offs. Key finding: no single sensing modality covers portability + global pose + finger articulation under occlusion simultaneously, so all practical systems use hierarchical or redundant sensor fusion. Tactile sensing is not optional for contact-rich tasks (+42pp success rate on a wiping task).

GI Labs 對靈巧機器人操作資料收集的運動捕捉系統進行實踐者調研，涵蓋 12 個近期系統的感測器權衡。核心發現：無單一感測模態能同時覆蓋可攜性、全局姿態和手指關節遮擋下的精度，因此所有實際系統都使用分層或冗餘感測器融合。觸覺感測對於接觸豐富任務不可選（一個擦拭任務 +42pp 成功率）。

Key Points

No sensing silver bullet: lab-grade optical mocap (sub-mm accuracy) can’t leave its camera cage; IMUs drift on yaw; SLAM degrades under heavy occlusion; EM gloves (Manus) distort near metal
Two fusion architectures: hierarchical — sensors assigned to complementary roles (SLAM for global pose, EM for fingers, composed via forward kinematics); redundant — overlapping sensors via EKF or factor graph optimization for asynchronous sensor fusion
Meta Quest 3 architecture insight: VI-SLAM for headset (~0.77 cm RPE) + active IR controller tracking (low-mm relative precision) + markerless hand tracking (1.73 cm) — three different accuracy tiers layered in one device
Apple Vision Pro vs Quest: AVP relies entirely on markerless hand tracking (lowest accuracy tier), no controllers — fundamentally different precision point from Quest’s controller-based approach
Tactile is primary for contact-rich tasks: OSMO’s 12 three-axis magnetic tactile sensors on fingertips → 72% vs 30% vision-only on wiping task; contact forces are not supplementary
DexCap reference design: SLAM + EM glove hierarchy — the practical benchmark for portable high-fidelity capture

Insights

The most actionable insight for dataset collection pipelines: your mocap choice defines what physical quantities your policy sees — if you omit tactile from collection, your policy literally never has access to contact force information during training. This is a data collection infrastructure decision that shapes what behaviors are learnable from your dataset, not just data quality.

The Quest 3 finding is practically important: its active IR controller tracking achieves low-mm precision relative to the headset, making it suitable for fine manipulation inputs (ARCap exploits this). This makes Quest 3 a surprisingly capable capture device for demos in the wild.

Connections

Raw Excerpt

Contact forces are not supplementary information that makes policies slightly better. For tasks involving sustained physical contact, they are a primary signal. Omitting them from your capture system means your policy never sees them.

bot_vault

Explorer

Tracking Dexterous Hands: A Practitioner's Guide to Motion Capture for Robot Learning

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks