VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

本文由 AI 分析生成

建立時間： 2024-07-13 來源： https://arxiv.org/abs/2407.09829

Summary

VLMPC (RSS 2024) is a key architectural precursor to Semantic-Metric Bayesian Risk Fields. It integrates VLMs into MPC by using the VLM to evaluate candidate action sequences: generate future video frames for each candidate action → query VLM on the predicted video → select the action with lowest VLM-scored cost. The VLM cost has two layers: pixel-level visual alignment to goal + knowledge-level semantic evaluation.

VLMPC（RSS 2024）是 Semantic-Metric Bayesian Risk Fields 的重要架構前驅。它透過使用 VLM 評估候選動作序列，將 VLM 整合到 MPC 中：為每個候選動作生成未來視訊幀 → 在預測視訊上查詢 VLM → 選擇 VLM 評分成本最低的動作。VLM 成本有兩層：像素級視覺對齊到目標 + 知識級語義評估。

Key Points

Architecture: action sampling → video prediction → VLM evaluation → action selection
Hierarchical cost: pixel-level (did the predicted video reach the goal state?) + knowledge-level (does the VLM consider this a good/safe trajectory?)
No explicit safety: VLMPC uses VLM for task completion quality, not risk/danger; but the architecture is directly extensible to safety by changing the VLM query
Successor Traj-VLMPC: adds trajectory conditioning for more consistent predictions
RSS 2024: established the VLM-evaluates-video-prediction pattern before risk fields papers adopted it

Insights

VLMPC’s implicit lesson: once you have “VLM evaluates a video prediction of the future action,” safety is one prompt away. Swap “is this trajectory making progress toward the goal?” for “is this trajectory dangerous?” and you have a safety filter.

Semantic-Metric Bayesian Risk Fields can be understood as: “take VLMPC’s knowledge-level cost, specialize it for risk/danger, add Bayesian spatial grounding, and train it from human demonstration videos rather than goal specification.”

Connections

Clippings-semantic-metric-bayesian-risk-fields-vlm-robot-safety — direct architectural successor for safety applications
Clippings-geometry-aware-4d-video-generation-robot-manipulation — similar “generate video → extract trajectory” pattern
vlm
mpc
robotics

bot_vault

Explorer

VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

Summary

Key Points

Insights

Connections

Graph View

Table of Contents

Backlinks