Summary

Hoi! (ETH Zurich, TU Munich, University of Freiburg) is a multimodal dataset of 3048 sequences across 381 articulated objects (drawers, doors, fridges, dishwashers) in 38 indoor environments. Each sequence couples vision, force/tactile sensing, and depth across four embodiments (human hand, wrist-camera hand, UMI gripper, Hoi! gripper). Designed to bridge the gap between human-centric activity datasets and robotics manipulation datasets.

Hoi! 是一個包含 3048 序列、381 個鉸接物件、38 個室內環境的多模態數據集。每個序列結合視覺、力覺/觸覺感測和深度信息,跨越四種實施體(人手、腕攝像頭人手、UMI 夾持器、Hoi! 夾持器)。旨在彌合以人類為中心的活動數據集與機器人操作數據集之間的差距。

Prerequisites

  • Articulated object manipulation — dataset targets furniture-scale articulation (joint parameters, opening angles); familiarity with URDF/kinematic representations helps
  • Force sensing — the Hoi! gripper provides force-torque and tactile sensing; understanding these modalities is needed to use the force annotations
  • Cross-view / embodiment transfer — key research question is whether human demonstrations transfer to robot embodiments

Core Idea

Existing datasets for articulation understanding use either static scans (no interaction data) or simulated environments (no real-world transfer). Robotics manipulation datasets target short-horizon primitives. Hoi! fills this gap with real-world multi-embodiment interaction data for articulated furniture, enabling research on: force-from-vision prediction, articulation state tracking, cross-view transfer (egocentric human → exocentric robot), and interaction re-targeting (human hand → robotic gripper).

Results

  • 3048 sequences, 381 objects, 38 environments
  • Four embodiments: human hand, human with wrist camera, UMI gripper, Hoi! custom gripper
  • Synchronized RGB, depth, force-torque, tactile, multi-view video
  • Annotations: articulation parameters (opening angles, displacements, peak forces), scene-level ground truth

Limitations

  • Author-stated: focuses on articulated furniture; doesn’t cover small object manipulation or tool use
  • Unstated: 381 objects in 38 environments may not capture full distribution of real-world furniture; indoor-only
  • Unstated: the 4-embodiment setup adds collection complexity; dataset scale (3048 sequences) modest compared to internet-scale video datasets

Reproducibility

  • Code/Data: website linked in paper; data and benchmarks available
  • Compute: standard computer vision / robotics training infrastructure
  • Benchmarks: force-from-vision, articulation estimation, cross-view transfer, state-change prediction

Insights

The cross-embodiment data collection (same object, four embodiments) is the dataset’s most distinctive contribution. Most robotics datasets capture only robot demonstrations; having matched human-hand data enables training on human demonstrations (abundant, natural) and transferring to robot execution (the ultimate goal of imitation learning from human video). The inclusion of synchronized force sensing alongside video is rare and enables research into grounded manipulation understanding that pure vision-based approaches cannot address.

Connections

Raw Excerpt

We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction.