ReMEmbR: Building and Evaluating Long-Horizon Memory for Robots

本文由 AI 分析生成

建立時間： 2025-01-01

Summary

EN: ReMEmbR is an NVIDIA system for giving robots long-horizon episodic memory by combining video captioning (VILA model) with a vector database and an LLM query loop. As a robot explores an environment, VILA generates semantic captions for video frames, which are stored with temporal and spatial embeddings. At query time, an LLM retrieves relevant memories and reasons over them. The system is evaluated on the new NaVQA dataset (210 examples from 20-minute navigation videos) and deployed on the Nova Carter robot.

ZH: ReMEmbR 是 NVIDIA 開發的機器人長期情節記憶系統，結合 VILA 影片描述模型、向量資料庫與 LLM 查詢迴圈。機器人探索環境時，VILA 為影像幀生成語義描述並儲存於向量資料庫（附時間與空間嵌入），查詢時 LLM 從中檢索相關記憶並推理。系統在 NaVQA 新資料集（210 個問答，來自 20 分鐘導航影片）上進行評估，並部署於 Nova Carter 機器人。

Key Points

Architecture: VILA (video captioning) → temporal/spatial embeddings → vector DB → LLM query loop
NaVQA dataset: 210 Q&A examples, 20-minute navigation videos, tests memory retrieval over long horizons
Nova Carter robot: Omniverse-powered differential drive robot used for physical deployment
Temporal embeddings encode when something was seen; spatial embeddings encode where
The LLM query loop allows multi-hop reasoning: “What was near the blue chair I saw earlier?”
Distinguishes episodic memory (specific experiences) from semantic memory (general knowledge)

Insights

The RAG-for-robots analogy is apt: ReMEmbR is essentially RAG applied to a robot’s perceptual history rather than a document corpus
Temporal + spatial embeddings are crucial — retrieval without location context would be nearly useless for navigation
The NaVQA dataset is a contribution in itself: long-horizon episodic QA is underserved in robotics benchmarks

Connections

Connects to RH20T: both address robot learning from rich multi-modal experience; ReMEmbR adds memory over the experience
The VLM prompting techniques from this vault apply here — VILA is itself a VLM being prompted at scale
Related to the open source robotics stack article: memory systems like ReMEmbR are a missing layer in most open-source stacks

Raw Excerpt

“As the robot navigates, VILA generates dense semantic captions for each frame, stored alongside temporal timestamps and spatial coordinates in a vector database. When asked ‘where did you see the red object?’, the LLM retrieves the most relevant memories and reasons over them — enabling the robot to answer questions about events that happened 15 minutes ago in a different room.”

bot_vault

Explorer

ReMEmbR: Building and Evaluating Long-Horizon Memory for Robots

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks