Vision-and-Language Navigation for Human-Robot Collaboration: Survey

本文由 AI 分析生成

建立時間： 2026-03-22 來源： https://arxiv.org/abs/2512.00027

Summary

A survey of ~200 papers on Vision-and-Language Navigation (VLN) as a framework for human-robot collaboration. Identifies three critical gaps in current systems: bidirectional communication (robots can’t ask clarifying questions), ambiguity resolution (unclear instructions cause unrecoverable failures), and multi-agent coordination. Recommends proactive clarification, real-time feedback, and decentralized decision-making.

對約 200 篇視覺語言導航（VLN）論文的調查，以 VLN 作為人機協作框架。識別出現有系統的三個關鍵缺口：雙向溝通（機器人無法提問澄清）、歧義解決（不清楚的指令導致無法恢復的失敗）和多智能體協調。建議主動澄清、即時回饋和去中心化決策。

Key Points

VLN: agent interprets natural language instructions + visual input to navigate 3D environments
Current systems are unidirectional: humans instruct, robots execute — no clarification possible
Ambiguity in instructions is a critical failure mode with no current solution
Multi-robot systems lack decentralized coordination frameworks
Target domains: healthcare, logistics, disaster response (high-stakes bidirectional scenarios)
Recommendations: proactive clarification, real-time feedback, contextual NLU, dynamic role assignment

Insights

The bidirectionality gap is fundamental: HRI implies interaction, but most current systems are essentially command-execution pipelines dressed up as “interaction” — this is a deeper design problem than a technical one
Proactive clarification (robot asks for disambiguation) requires the robot to have a model of its own uncertainty — this connects to uncertainty quantification in neural networks, a still-unsolved problem in deep learning
The multi-agent gap is interesting in the context of VLAs: GR-Dexter focuses on a single bimanual robot, but real deployment scenarios often involve multiple robots + humans in shared workspaces
“Decentralized decision-making with dynamic role assignment” is the multi-agent robotics equivalent of microservices: each agent has local autonomy but coordinates via shared protocols

Connections

Raw Excerpt

Current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in multi-agent systems.

bot_vault

Explorer

Vision-and-Language Navigation for Human-Robot Collaboration: Survey

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks