Summary

Current robot learning systems suffer from representation misalignment: the features and abstractions a robot learns to represent its world do not match what humans actually care about. This paper by Bobu et al. formalizes this problem mathematically, argues that representation alignment must be treated as an explicit objective alongside task learning, and reframes existing methods (reward learning, IRL, preference learning) through this lens.

人類和機器人對世界的表徵方式存在根本差異,導致機器人學到的特徵無法捕捉人類真正在意的概念。本文提出數學框架正式定義「表徵對齊」問題,並將現有的 reward learning 和偏好學習方法統一放入此框架中分析,主張應將表徵對齊作為與任務學習並列的顯式目標。

Prerequisites

  • Reward learning / IRL — the paper reinterprets these as implicit representation alignment methods, so understanding the basics of learning reward functions from demonstrations is essential.
  • Feature representations in RL — the paper’s core argument hinges on the gap between state features a robot learns and the latent features humans use; familiarity with representation learning helps.
  • Human-robot interaction (HRI) feedback models — the paper frames human corrections and preference queries as mechanisms for closing the representation gap, not just for reward shaping.

Core Idea

The key insight is that misalignment is not a task-learning failure but a representation-level failure: even a perfectly trained reward function over misaligned features will produce wrong behavior. The authors propose that robot learning pipelines should explicitly optimize for alignment between human and robot feature spaces, potentially via dedicated alignment queries or interactive probing, before or alongside learning the task. This reframes human feedback not as “teaching the robot what to do” but as “teaching the robot what to notice.”

Results

The paper is primarily a framework/position paper (14 pages, 3 figures, 1 table). No novel benchmark results are reported. The contribution is conceptual: a unified mathematical formulation and a taxonomy of existing methods under the alignment lens.

ContributionDescription
Formal definitionMathematical framework for representation misalignment
Method taxonomyExisting IRL/preference methods classified by alignment mechanism
Research agendaOpen problems and future directions for explicit alignment

Limitations

  • Author-stated: The framework is theoretical; no new algorithms or empirical results are presented.
  • Unstated: The paper assumes human representations are stable and well-defined — in practice human mental models are noisy and context-dependent, which complicates the alignment target.

Reproducibility

  • Code: Not applicable (theoretical paper)
  • Datasets: Not applicable
  • Compute: Not applicable

Insights

  • Treating representation alignment as a first-class objective is a meaningful reframing: most HRI/reward-learning work implicitly assumes shared representations, and this paper makes the assumption explicit and attackable.
  • The connection to VLA models is direct: VLAs encode visual features from pretraining that may not align with task-relevant human concepts, making this framework relevant to the VLA alignment problem.
  • This positions human feedback not as a training signal for values, but as a probe for representation gaps — a subtle but important distinction for active learning system design.

Connections

Raw Excerpt

“current learning approaches suffer from representation misalignment, where the robot’s learned representation does not capture the human’s representation. We argue that representation alignment should be explicitly prioritized alongside task learning.”