本文由 AI 分析生成
建立時間: 2026-03-26 來源: https://dex-cap.github.io/
Summary
DexCap is a portable wearable motion capture system for collecting dexterous hand manipulation data without teleoperation. The operator wears mocap gloves (EMF finger tracking) and a chest-mounted camera rack (SLAM + RGB-D), collects data naturally, and the DexIL algorithm retargets the demonstrations to a robot via fingertip inverse kinematics. Evaluated on 6 dexterous tasks; 3x faster data throughput than teleoperation. Supports bimanual manipulation and human-in-the-loop correction.
DexCap 是一套穿戴式動作捕捉系統,無需遙操作即可收集靈巧手操作示範資料。操作者佩戴動捕手套(EMF 手指追蹤)和胸掛相機架(SLAM + RGB-D),自然收集資料,DexIL 算法通過指尖逆運動學將示範重定向到機器人。在 6 個靈巧任務上評估;資料採集速度比遙操作快 3 倍。
Prerequisites
- Inverse kinematics (IK) — fingertip positions from mocap gloves are converted to robot joint angles via fingertip IK; understanding IK is essential for the retargeting step
- Diffusion Policy — DexIL uses diffusion policy for imitation learning; the policy takes point clouds as input and outputs action sequences
- SLAM (Simultaneous Localization and Mapping) — DexCap uses SLAM cameras for occlusion-resistant wrist tracking; basic understanding of visual odometry helps
- Point cloud processing — 3D observations are represented as point clouds; familiarity with spatial representations and registration helps
Core Idea
Teleoperation forces the operator to work through an interface (VR/joystick) that degrades movement naturalness and throughput. DexCap eliminates the interface: the operator performs the task naturally while wearing sensors. SLAM cameras on the chest track wrist positions; EMF mocap gloves track individual finger joints with resistance to occlusion (unlike vision-based systems that fail when fingers overlap). A quick-release buckle swaps the same camera system between operator and robot in under 20 seconds, eliminating visual domain gap. DexIL then retargets this data to the robot via fingertip IK + diffusion policy.
Results
| Task | Collection Method | Notes |
|---|---|---|
| 6 dexterous tasks | Human mocap, 30 min | No teleoperation required |
| Bimanual task | Human mocap, 30 min | Fully autonomous rollout |
| Tea preparation (HIL) | 1 hr mocap + 30 HIL corrections | Post-finetune improvement |
| Scissor cutting (HIL) | 1 hr mocap + 30 HIL corrections | Post-finetune improvement |
Throughput: ~3x faster than teleoperation, close to natural human motion speed. Vision vs. EMF: EMF-based finger tracking significantly outperforms vision-based (VR headset) on heavy-occlusion grasps.
Limitations
- Author-stated: ~40-minute battery life limits continuous collection sessions
- Author-stated: requires specialized LEAP Hand for policy deployment; not hardware-agnostic
- Unstated: EMF mocap gloves are expensive (~$3000-5000) compared to vision-only systems like AnyTeleop; the “portable” framing is relative to lab mocap but not consumer VR
Reproducibility
- Code: available at dex-cap.github.io; LEAP Hand is open-source
- Datasets: collected datasets available; uses standard RGB-D sensors
- Compute: diffusion policy training; single or multi-GPU
Insights
The key design decision is collecting data in “human space” rather than “robot space.” By letting the operator perform the task naturally, demonstration quality approaches what a human would naturally do — not what a human can do while mentally translating to robot controls. The 3x throughput advantage over teleoperation follows directly from eliminating the translation overhead. The quick-release camera swap is an underappreciated detail: it eliminates the visual domain gap between demonstration and deployment without any domain adaptation.
Connections
- AnyTeleop: Vision-Based Dexterous Teleoperation
- OPEN TEACH: Versatile Teleoperation System
- diffusion policy
- fingertip inverse kinematics
- SLAM-based tracking
Raw Excerpt
DexCap offers precise, occlusion-resistant tracking of wrist and finger motions based on SLAM and electromagnetic field together with 3D observations of the environment… DexCap is about three times faster than teleoperation in data collection throughput and is close to the level of natural human motion.