Summary

Open-TeleVision is an immersive teleoperation system for humanoid robots that streams stereoscopic video from robot-mounted head cameras to a VR headset, giving the operator a first-person perspective from inside the robot’s body. The operator’s arm and hand movements are captured and retargeted to the robot in real time. Validated on four complex manipulation tasks across two humanoid platforms; demonstration data collected via this system is used to train high-performing imitation learning policies. Fully open-source.

Open-TeleVision 是一套用於人形機器人的沉浸式遙操作系統,將機器人頭部的立體攝像頭影像串流至 VR 頭戴顯示器,讓操作者從機器人第一人稱視角觀察並操控。在兩個不同人形機器人平台上驗證了四項複雜操作任務。完全開源。

Prerequisites

  • VR/XR headset technology — stereoscopic rendering, head tracking, and latency requirements for immersive telepresence
  • Humanoid robot kinematics — arm + hand retargeting from human to robot requires understanding of joint correspondence and workspace mapping
  • Imitation learning pipeline — the paper collects demonstration data for downstream BC/diffusion policy training; understanding this pipeline contextualizes the motivation
  • Camera calibration and stereo vision — streaming calibrated stereo video from a moving robot head involves nontrivial camera setup

Core Idea

Active visual feedback is the distinguishing design choice: the operator can control the robot’s head orientation to redirect their gaze while teleoperating, mirroring how humans naturally coordinate gaze and manipulation. Standard teleoperation systems use fixed cameras, forcing operators to work with suboptimal viewpoints. Open-TeleVision lets the operator look where they’re working — toward the object being manipulated — which produces more naturalistic demonstrations and reduces errors caused by occlusion. The result is higher-quality IL training data because the operator’s sensory experience matches what the robot will experience during autonomous deployment.

Results

TaskPlatformNotes
Can SortingHumanoid AContact-rich sorting
Can InsertionHumanoid APrecision insertion
FoldingHumanoid BDeformable object
UnloadingHumanoid BMulti-step manipulation
  • Evaluated on 2 different humanoid robot platforms — same system without hardware-specific re-engineering
  • Downstream IL policies trained on collected data showed strong task success rates
  • Open-source release includes full system code

Limitations

  • Author-stated: evaluated on 4 specific tasks; generalization to diverse manipulation tasks not fully characterized
  • Unstated: immersive VR setups may cause motion sickness during extended teleoperation sessions; operator fatigue is not studied; VR headset cost (~$500-3000) is not discussed

Reproducibility

  • Code: fully open-source (GitHub)
  • Datasets: demonstration datasets available for the 4 evaluated tasks
  • Compute: policy training uses standard IL setups (GPU required)

Insights

The “active” in active visual feedback is underappreciated. Most teleoperation research treats the camera as infrastructure and focuses on motor control. Open-TeleVision treats the camera as part of the operator’s sensory apparatus — something that should be actively controlled to enable skilled manipulation. This insight applies beyond VR: any teleoperation system that gives operators more control over their viewpoint will produce better demonstrations.

Connections

Raw Excerpt

An immersive teleoperation system that allows operators to actively perceive a robot’s surroundings in a stereoscopic manner. The system replicates the operator’s arm and hand movements onto the robot — “the operator’s mind is transmitted to a robot embodiment.”