Summary
GR-Dexter is ByteDance Research’s integrated framework for training a 4B Mixture-of-Transformer VLA to control bimanual robots equipped with 21-DOF dexterous hands. The system co-trains on web-scale vision-language data, cross-embodiment robot datasets, and human VR teleoperation trajectories, achieving 97% success on in-distribution long-horizon tasks and 89% out-of-distribution — a 25-point improvement over baselines.
GR-Dexter 是 ByteDance Research 開發的整合框架,使用 40 億參數的 Mixture-of-Transformer VLA 控制配備 21 自由度靈巧手的雙臂機器人。系統混合訓練網頁規模視覺語言資料、跨具身機器人資料集及人類 VR 遠程操控軌跡,在分布內長時序任務上達到 97% 成功率,分布外達 89%,比基準線高出 25 個百分點。
Key Points
- ByteDexter V2 Hand: 21 DOF (4 per finger, 5 for thumb), 219mm × 108mm, piezoresistive tactile arrays on fingertips
- Teleoperation stack: Meta Quest VR (wrist) + Manus Metagloves (finger capture) + foot pedals (arm control) for two Franka arms
- Architecture: 4B MoT (Mixture-of-Transformer), inputs are language + visual observations + robot state, outputs are action chunks covering arm joints, end-effector poses, hand joints, and fingertip positions
- Training data: three co-training sources — web vision-language, cross-embodiment (Fourier ActionNet, OpenLoong, RoboMIND), human VR trajectories with temporal consistency filtering
- Out-of-distribution long-horizon: 89% vs 64% baseline — cross-embodiment data is the key driver of generalization
- Pick-and-place: 85% unseen objects, 83% unseen instructions
Insights
- The 21-DOF hand design (vs. typical 2-finger grippers) makes GR-Dexter directly comparable to human hand morphology — this is the hardware prerequisite for manipulation tasks that require finger-level dexterity like tool use or assembly
- Temporal consistency filtering on VR trajectories is a quiet but important detail: human demonstrations are noisy and include pauses/hesitations; filtering enforces action smoothness which is critical for imitation learning
- The 25-point OOD improvement from cross-embodiment co-training (0.89 vs 0.64) echoes the finding from the ICLR 2026 VLA survey that data diversity matters more than architecture for generalization — the Mixture-of-Transformer is less novel than the data curation strategy
- Using foot pedals to free up hand tracking for Manus gloves is an elegant biomechanics solution: the bottleneck in bimanual teleoperation is that humans only have two hands to control two arms AND two hands simultaneously
- The MoT architecture is well-suited for multi-modal, multi-output action spaces: routing different tokens to specialized experts aligns naturally with the heterogeneous outputs (arm joints vs. finger joints have very different dynamics)
Connections
- State of VLA Research at ICLR 2026
- Vision-Language-Action Models
- Robotics
- Embodied AI
- Mixture of Experts
- Teleoperation
- ByteDance Research
Raw Excerpt
Cross-embodiment data significantly improves generalization to novel scenarios while maintaining strong in-domain performance.