Hoi! A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

本文由 AI 分析生成

建立時間： 2026-03-28 來源： https://arxiv.org/abs/2512.04884

Summary

Hoi! (ETH Zurich, TU Munich, University of Freiburg) is a multimodal dataset of 3048 sequences across 381 articulated objects (drawers, doors, fridges, dishwashers) in 38 indoor environments. Each sequence couples vision, force/tactile sensing, and depth across four embodiments (human hand, wrist-camera hand, UMI gripper, Hoi! gripper). Designed to bridge the gap between human-centric activity datasets and robotics manipulation datasets.

Hoi! 是一個包含 3048 序列、381 個鉸接物件、38 個室內環境的多模態數據集。每個序列結合視覺、力覺/觸覺感測和深度信息，跨越四種實施體（人手、腕攝像頭人手、UMI 夾持器、Hoi! 夾持器）。旨在彌合以人類為中心的活動數據集與機器人操作數據集之間的差距。

Prerequisites

Articulated object manipulation — dataset targets furniture-scale articulation (joint parameters, opening angles); familiarity with URDF/kinematic representations helps
Force sensing — the Hoi! gripper provides force-torque and tactile sensing; understanding these modalities is needed to use the force annotations
Cross-view / embodiment transfer — key research question is whether human demonstrations transfer to robot embodiments

Core Idea

Existing datasets for articulation understanding use either static scans (no interaction data) or simulated environments (no real-world transfer). Robotics manipulation datasets target short-horizon primitives. Hoi! fills this gap with real-world multi-embodiment interaction data for articulated furniture, enabling research on: force-from-vision prediction, articulation state tracking, cross-view transfer (egocentric human → exocentric robot), and interaction re-targeting (human hand → robotic gripper).

Results

3048 sequences, 381 objects, 38 environments
Four embodiments: human hand, human with wrist camera, UMI gripper, Hoi! custom gripper
Synchronized RGB, depth, force-torque, tactile, multi-view video
Annotations: articulation parameters (opening angles, displacements, peak forces), scene-level ground truth

Limitations

Author-stated: focuses on articulated furniture; doesn’t cover small object manipulation or tool use
Unstated: 381 objects in 38 environments may not capture full distribution of real-world furniture; indoor-only
Unstated: the 4-embodiment setup adds collection complexity; dataset scale (3048 sequences) modest compared to internet-scale video datasets

Reproducibility

Code/Data: website linked in paper; data and benchmarks available
Compute: standard computer vision / robotics training infrastructure
Benchmarks: force-from-vision, articulation estimation, cross-view transfer, state-change prediction

Insights

The cross-embodiment data collection (same object, four embodiments) is the dataset’s most distinctive contribution. Most robotics datasets capture only robot demonstrations; having matched human-hand data enables training on human demonstrations (abundant, natural) and transferring to robot execution (the ultimate goal of imitation learning from human video). The inclusion of synchronized force sensing alongside video is rare and enables research into grounded manipulation understanding that pure vision-based approaches cannot address.

Connections

Raw Excerpt

We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction.

bot_vault

Explorer

Hoi! A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks