Semantic-Metric Bayesian Risk Fields: Learning Robot Safety from Human Videos with a VLM Prior

本文由 AI 分析生成

建立時間： 2025-12-09 來源： https://arxiv.org/abs/2512.08233

Summary

Stanford 2025 — The most direct example of using a VLM as a safety oracle in robotics. The system trains a Bayesian risk field: a VLM provides semantic common sense as the prior (understanding that a knife is dangerous), and a learned ViT modulates this prior into spatially precise, pixel-aligned risk maps using safe human video demonstrations as supervision. Risk maps are compatible with both visuomotor planners and classical 3D trajectory optimization.

Stanford 2025 — 在機器人中使用 VLM 作為安全 oracle 的最直接範例。系統訓練貝葉斯風險場：VLM 提供語義常識作為先驗（理解刀具是危險的），學習到的 ViT 使用安全人類影片演示作為監督，將此先驗調製成空間精確的像素級風險圖。風險圖與視覺運動規劃器和傳統 3D 軌跡優化均相容。

Key Points

Two-component Bayesian structure: VLM prior captures semantic context; learned ViT likelihood captures spatial precision from real demonstrations
Supervision signal: human demonstration videos — where humans avoid = high risk — no manual cost function specification
Output format: pixel-dense risk image → directly usable in existing planners
Context-sensitive safety: understands that a knife is riskier than a spatula in the same geometric configuration — impossible with purely metric safety methods
Human alignment: outperforms standalone VLM risk assessment by grounding semantic priors in real spatial behavior

Insights

The key architectural insight: VLMs alone give you context but not precision (they can say “the knife is dangerous” but not “exactly 5cm to the left is dangerous”). The learned ViT provides the spatial grounding that VLMs lack. This two-stage decomposition elegantly separates semantic understanding from spatial precision.

The supervision signal (human demonstration videos) is remarkable: you get safety supervision for free from existing video datasets, without needing manually labeled danger zones.

The implicit assumption is that human behavior in videos is safe — if the training data contains unsafe behavior, the risk field will be miscalibrated. This is the same distributional assumption as RWM-U.

Connections

Clippings-llm-vlm-controlled-robotics-vulnerability — complementary: this paper uses VLMs for safety; that paper shows VLMs are themselves vulnerable
Clippings-safety-security-cognitive-risks-world-models — the VLM prior could itself be a threat surface (VLM could be prompted to misclassify risk)
world-models
safety
vlm
robotics

bot_vault

Explorer

Semantic-Metric Bayesian Risk Fields: Learning Robot Safety from Human Videos with a VLM Prior

Summary

Key Points

Insights

Connections

Graph View

Table of Contents

Backlinks