Summary

GR-Dexter is ByteDance Research’s integrated framework for training a 4B Mixture-of-Transformer VLA to control bimanual robots equipped with 21-DOF dexterous hands. The system co-trains on web-scale vision-language data, cross-embodiment robot datasets, and human VR teleoperation trajectories, achieving 97% success on in-distribution long-horizon tasks and 89% out-of-distribution — a 25-point improvement over baselines.

GR-Dexter 是 ByteDance Research 開發的整合框架,使用 40 億參數的 Mixture-of-Transformer VLA 控制配備 21 自由度靈巧手的雙臂機器人。系統混合訓練網頁規模視覺語言資料、跨具身機器人資料集及人類 VR 遠程操控軌跡,在分布內長時序任務上達到 97% 成功率,分布外達 89%,比基準線高出 25 個百分點。

Key Points

  • ByteDexter V2 Hand: 21 DOF (4 per finger, 5 for thumb), 219mm × 108mm, piezoresistive tactile arrays on fingertips
  • Teleoperation stack: Meta Quest VR (wrist) + Manus Metagloves (finger capture) + foot pedals (arm control) for two Franka arms
  • Architecture: 4B MoT (Mixture-of-Transformer), inputs are language + visual observations + robot state, outputs are action chunks covering arm joints, end-effector poses, hand joints, and fingertip positions
  • Training data: three co-training sources — web vision-language, cross-embodiment (Fourier ActionNet, OpenLoong, RoboMIND), human VR trajectories with temporal consistency filtering
  • Out-of-distribution long-horizon: 89% vs 64% baseline — cross-embodiment data is the key driver of generalization
  • Pick-and-place: 85% unseen objects, 83% unseen instructions

Insights

  • The 21-DOF hand design (vs. typical 2-finger grippers) makes GR-Dexter directly comparable to human hand morphology — this is the hardware prerequisite for manipulation tasks that require finger-level dexterity like tool use or assembly
  • Temporal consistency filtering on VR trajectories is a quiet but important detail: human demonstrations are noisy and include pauses/hesitations; filtering enforces action smoothness which is critical for imitation learning
  • The 25-point OOD improvement from cross-embodiment co-training (0.89 vs 0.64) echoes the finding from the ICLR 2026 VLA survey that data diversity matters more than architecture for generalization — the Mixture-of-Transformer is less novel than the data curation strategy
  • Using foot pedals to free up hand tracking for Manus gloves is an elegant biomechanics solution: the bottleneck in bimanual teleoperation is that humans only have two hands to control two arms AND two hands simultaneously
  • The MoT architecture is well-suited for multi-modal, multi-output action spaces: routing different tokens to specialized experts aligns naturally with the heterogeneous outputs (arm joints vs. finger joints have very different dynamics)

Connections

Raw Excerpt

Cross-embodiment data significantly improves generalization to novel scenarios while maintaining strong in-domain performance.