SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

本文由 AI 分析生成

建立時間： 2025-03-05 來源： https://arxiv.org/abs/2503.03480

Summary

SafeVLA (NeurIPS 2025 Spotlight) is the VLA-era counterpart to SafeDreamer — both from PKU-Alignment, both based on CMDP constrained RL. While SafeDreamer enforces safety constraints in a DreamerV3 latent world model, SafeVLA enforces them directly in a VLA policy. The Integrated Safety Approach (ISA) systematically elicits unsafe VLA behaviors, then constrains the policy against them via safe RL. 83.58% reduction in safety violations; maintains task performance.

SafeVLA（NeurIPS 2025 Spotlight）是 VLA 時代的 SafeDreamer 對應版本——兩者均來自 PKU-Alignment，均基於 CMDP 約束強化學習。SafeDreamer 在 DreamerV3 潛在世界模型中強制執行安全約束，SafeVLA 則直接在 VLA 策略中強制執行。整合安全方法（ISA）系統性地誘發不安全的 VLA 行為，然後透過安全強化學習對其進行約束。安全違規減少 83.58%，同時保持任務性能。

Key Points

CMDP paradigm: reward maximization subject to cumulative safety cost constraint — same framework as SafeDreamer
Unsafe behavior elicitation: proactively generates failure modes the model hasn’t encountered, improving constraint coverage
ISA pipeline: requirements → elicitation → constrained training → targeted evaluation
Benchmark: Safety-CHORES — long-horizon mobile manipulation tasks with diverse safety requirements
Spotlight at NeurIPS 2025: signals community consensus that VLA safety alignment is a mature research direction

Insights

SafeVLA is the natural synthesis of two prior lines: (1) CMDP-based safe RL for traditional RL agents (SafeDreamer), and (2) VLA foundation models as robot policies. By combining them, SafeVLA allows the VLA’s semantic understanding to inform what constitutes unsafe behavior — something a pure reward/cost function cannot capture.

The unsafe behavior elicitation step is noteworthy: instead of relying on environment rollouts to discover violations, it actively adversarially generates them. This is the VLA equivalent of Parmar 2026’s “safety probing” concept.

Connections

Clippings-safedreamer-safe-reinforcement-learning-world-models — same CMDP framework, different substrate (VLA vs. world model)
Clippings-semantic-metric-bayesian-risk-fields-vlm-robot-safety — both use VLM semantics for safety, different architectural role (external oracle vs. internal policy)
vla
safe-rl
robotics

bot_vault

Explorer

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

Summary

Key Points

Insights

Connections

Graph View

Table of Contents

Backlinks