Summary
A survey of ~200 papers on Vision-and-Language Navigation (VLN) as a framework for human-robot collaboration. Identifies three critical gaps in current systems: bidirectional communication (robots can’t ask clarifying questions), ambiguity resolution (unclear instructions cause unrecoverable failures), and multi-agent coordination. Recommends proactive clarification, real-time feedback, and decentralized decision-making.
對約 200 篇視覺語言導航(VLN)論文的調查,以 VLN 作為人機協作框架。識別出現有系統的三個關鍵缺口:雙向溝通(機器人無法提問澄清)、歧義解決(不清楚的指令導致無法恢復的失敗)和多智能體協調。建議主動澄清、即時回饋和去中心化決策。
Key Points
- VLN: agent interprets natural language instructions + visual input to navigate 3D environments
- Current systems are unidirectional: humans instruct, robots execute — no clarification possible
- Ambiguity in instructions is a critical failure mode with no current solution
- Multi-robot systems lack decentralized coordination frameworks
- Target domains: healthcare, logistics, disaster response (high-stakes bidirectional scenarios)
- Recommendations: proactive clarification, real-time feedback, contextual NLU, dynamic role assignment
Insights
- The bidirectionality gap is fundamental: HRI implies interaction, but most current systems are essentially command-execution pipelines dressed up as “interaction” — this is a deeper design problem than a technical one
- Proactive clarification (robot asks for disambiguation) requires the robot to have a model of its own uncertainty — this connects to uncertainty quantification in neural networks, a still-unsolved problem in deep learning
- The multi-agent gap is interesting in the context of VLAs: GR-Dexter focuses on a single bimanual robot, but real deployment scenarios often involve multiple robots + humans in shared workspaces
- “Decentralized decision-making with dynamic role assignment” is the multi-agent robotics equivalent of microservices: each agent has local autonomy but coordinates via shared protocols
Connections
- Human-Robot Interaction
- Vision-Language-Action Models
- GR-Dexter: VLA for Bimanual Dexterous Robot Control
- Robotics
- Natural Language Processing
- Multi-Agent Systems
Raw Excerpt
Current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in multi-agent systems.