本文由 AI 分析生成
建立時間: 2025-12-09 來源: https://arxiv.org/abs/2512.08233
Summary
Stanford 2025 — The most direct example of using a VLM as a safety oracle in robotics. The system trains a Bayesian risk field: a VLM provides semantic common sense as the prior (understanding that a knife is dangerous), and a learned ViT modulates this prior into spatially precise, pixel-aligned risk maps using safe human video demonstrations as supervision. Risk maps are compatible with both visuomotor planners and classical 3D trajectory optimization.
Stanford 2025 — 在機器人中使用 VLM 作為安全 oracle 的最直接範例。系統訓練貝葉斯風險場:VLM 提供語義常識作為先驗(理解刀具是危險的),學習到的 ViT 使用安全人類影片演示作為監督,將此先驗調製成空間精確的像素級風險圖。風險圖與視覺運動規劃器和傳統 3D 軌跡優化均相容。
Key Points
- Two-component Bayesian structure: VLM prior captures semantic context; learned ViT likelihood captures spatial precision from real demonstrations
- Supervision signal: human demonstration videos — where humans avoid = high risk — no manual cost function specification
- Output format: pixel-dense risk image → directly usable in existing planners
- Context-sensitive safety: understands that a knife is riskier than a spatula in the same geometric configuration — impossible with purely metric safety methods
- Human alignment: outperforms standalone VLM risk assessment by grounding semantic priors in real spatial behavior
Insights
The key architectural insight: VLMs alone give you context but not precision (they can say “the knife is dangerous” but not “exactly 5cm to the left is dangerous”). The learned ViT provides the spatial grounding that VLMs lack. This two-stage decomposition elegantly separates semantic understanding from spatial precision.
The supervision signal (human demonstration videos) is remarkable: you get safety supervision for free from existing video datasets, without needing manually labeled danger zones.
The implicit assumption is that human behavior in videos is safe — if the training data contains unsafe behavior, the risk field will be miscalibrated. This is the same distributional assumption as RWM-U.
Connections
- Clippings-llm-vlm-controlled-robotics-vulnerability — complementary: this paper uses VLMs for safety; that paper shows VLMs are themselves vulnerable
- Clippings-safety-security-cognitive-risks-world-models — the VLM prior could itself be a threat surface (VLM could be prompted to misclassify risk)
- world-models
- safety
- vlm
- robotics