Summary

A survey of ~200 papers on Vision-and-Language Navigation (VLN) as a framework for human-robot collaboration. Identifies three critical gaps in current systems: bidirectional communication (robots can’t ask clarifying questions), ambiguity resolution (unclear instructions cause unrecoverable failures), and multi-agent coordination. Recommends proactive clarification, real-time feedback, and decentralized decision-making.

對約 200 篇視覺語言導航(VLN)論文的調查,以 VLN 作為人機協作框架。識別出現有系統的三個關鍵缺口:雙向溝通(機器人無法提問澄清)、歧義解決(不清楚的指令導致無法恢復的失敗)和多智能體協調。建議主動澄清、即時回饋和去中心化決策。

Key Points

  • VLN: agent interprets natural language instructions + visual input to navigate 3D environments
  • Current systems are unidirectional: humans instruct, robots execute — no clarification possible
  • Ambiguity in instructions is a critical failure mode with no current solution
  • Multi-robot systems lack decentralized coordination frameworks
  • Target domains: healthcare, logistics, disaster response (high-stakes bidirectional scenarios)
  • Recommendations: proactive clarification, real-time feedback, contextual NLU, dynamic role assignment

Insights

  • The bidirectionality gap is fundamental: HRI implies interaction, but most current systems are essentially command-execution pipelines dressed up as “interaction” — this is a deeper design problem than a technical one
  • Proactive clarification (robot asks for disambiguation) requires the robot to have a model of its own uncertainty — this connects to uncertainty quantification in neural networks, a still-unsolved problem in deep learning
  • The multi-agent gap is interesting in the context of VLAs: GR-Dexter focuses on a single bimanual robot, but real deployment scenarios often involve multiple robots + humans in shared workspaces
  • “Decentralized decision-making with dynamic role assignment” is the multi-agent robotics equivalent of microservices: each agent has local autonomy but coordinates via shared protocols

Connections

Raw Excerpt

Current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in multi-agent systems.