Summary

VLMPC (RSS 2024) is a key architectural precursor to Semantic-Metric Bayesian Risk Fields. It integrates VLMs into MPC by using the VLM to evaluate candidate action sequences: generate future video frames for each candidate action → query VLM on the predicted video → select the action with lowest VLM-scored cost. The VLM cost has two layers: pixel-level visual alignment to goal + knowledge-level semantic evaluation.

VLMPC(RSS 2024)是 Semantic-Metric Bayesian Risk Fields 的重要架構前驅。它透過使用 VLM 評估候選動作序列,將 VLM 整合到 MPC 中:為每個候選動作生成未來視訊幀 → 在預測視訊上查詢 VLM → 選擇 VLM 評分成本最低的動作。VLM 成本有兩層:像素級視覺對齊到目標 + 知識級語義評估。

Key Points

  • Architecture: action sampling → video prediction → VLM evaluation → action selection
  • Hierarchical cost: pixel-level (did the predicted video reach the goal state?) + knowledge-level (does the VLM consider this a good/safe trajectory?)
  • No explicit safety: VLMPC uses VLM for task completion quality, not risk/danger; but the architecture is directly extensible to safety by changing the VLM query
  • Successor Traj-VLMPC: adds trajectory conditioning for more consistent predictions
  • RSS 2024: established the VLM-evaluates-video-prediction pattern before risk fields papers adopted it

Insights

VLMPC’s implicit lesson: once you have “VLM evaluates a video prediction of the future action,” safety is one prompt away. Swap “is this trajectory making progress toward the goal?” for “is this trajectory dangerous?” and you have a safety filter.

Semantic-Metric Bayesian Risk Fields can be understood as: “take VLMPC’s knowledge-level cost, specialize it for risk/danger, add Bayesian spatial grounding, and train it from human demonstration videos rather than goal specification.”

Connections