World Models for Robot Safety
Research Question
Which approaches use world models — learned predictive models of environment dynamics — to achieve safety in robotics? This survey covers safe RL via constraint satisfaction in latent space, uncertainty-penalized offline learning, simulation-based safety probing, and deployment-time risk analysis.
Knowledge Map
- World Models (RSSM / DreamerV3) — prerequisite because all surveyed safe-RL methods build on the Dreamer family of latent-space world models; understanding RSSM (Recurrent State-Space Model) is essential for reading SafeDreamer or RWM-U
- Safe Reinforcement Learning (SafeRL) — the dominant framing: separate reward and cost signals, constrain cumulative cost below a threshold; Lagrangian relaxation and CPO are the standard baselines that world-model methods improve upon
- Lagrangian Methods for Constrained Optimization — SafeDreamer’s core mechanism; dual-variable optimization enforces hard cost constraints during imagined rollouts
- Epistemic Uncertainty in Deep Learning — ensembles, MC Dropout, or conformal prediction as proxies for “how much does the model know about this state?”; prerequisite for RWM-U and MOPO-style offline RL
- Offline / Batch RL — policy optimization from a fixed dataset with no online interaction; distribution shift and compounding errors are the key failure modes that uncertainty-penalization addresses
- Model Predictive Control (MPC) with Safety Filters — the classical robotics safety baseline; CBF-based safety filters and NMPC with constraints are the non-ML approaches that world-model methods compete with and complement
- Adversarial ML and Alignment — needed to understand the threat taxonomy from Parmar 2026; data poisoning, latent corruption, and reward hacking apply specifically to world model components
- Safety-Gymnasium Benchmark — the standard evaluation suite for SafeRL; point-goal, car-goal, and doggo-goal tasks with cost signals; knowing the benchmark is needed to interpret SafeDreamer’s results
Sources Gathered
New sources clipped and analyzed during this research:
- Clippings-safedreamer-safe-reinforcement-learning-world-models — SafeDreamer (ICLR 2024): Lagrangian constraints in DreamerV3 latent space, near-zero cost violations on Safety-Gymnasium
- Clippings-uncertainty-aware-robotic-world-model-offline-rl — RWM-U (ETH Zurich, 2025): ensemble uncertainty penalization for offline MBRL, deployed on real ANYmal quadruped and humanoid
- Clippings-safety-security-cognitive-risks-world-models — Parmar 2026: threat taxonomy for world models (adversarial, alignment, human factors), governance frameworks
- Clippings-world-model-robot-learning-comprehensive-survey-2605 — Comprehensive survey (2026, Abbeel/Malik/Wu): world models as simulators for offline policy evaluation, OOD testing, safety probing
Extended scope sources (added 2026-05-14):
- Clippings-semantic-metric-bayesian-risk-fields-vlm-robot-safety — Semantic-Metric Bayesian Risk Fields (Stanford, Dec 2025): VLM prior + ViT spatial grounding for pixel-dense risk maps
- Clippings-llm-vlm-controlled-robotics-vulnerability — VLM/LLM vulnerability in robotics (2024): 14–22% success drop from simple perturbations
- Clippings-safevla-safety-alignment-vla-constrained-learning — SafeVLA (NeurIPS 2025 Spotlight): CMDP safety alignment for VLA models, 83.58% violation reduction
- Clippings-vlmpc-vision-language-model-predictive-control — VLMPC (RSS 2024): VLM-as-cost-function in MPC, architectural precursor to VLM safety filters
Existing vault notes referenced:
- Clippings-130-robotics-world-model-reading-club-01 — SF Reading Club (2026-03-28): VLA → WAM paradigm shift and the FastWAM insight that inference differs from training
- Clippings-beyond-the-hype-how-i-see-world-models-evolving-in-2025-nemos-blog — Researcher perspective on world model maturity: action space heterogeneity as a fundamental barrier
- Clippings-particleformer-3d-point-cloud-world-model-robot-manipulation — ParticleFormer: Transformer on 3D point clouds for multi-material dynamics, used with MPPI for safe manipulation
- Clippings-geometry-aware-4d-video-generation-robot-manipulation — Geometry-aware 4D Video Generation (ICLR 2026): cross-view pointmap supervised video diffusion as implicit world model
Key Findings
1. Lagrangian constraints in latent space are the dominant safe-RL approach. SafeDreamer shows that imagining cost rollouts inside the world model and applying dual-variable optimization achieves near-zero constraint violations — something model-free SafeRL cannot reliably do, especially on vision-only tasks. The world model’s differentiable simulator is what makes this tractable.
2. Uncertainty penalization is the primary safety mechanism for offline learning. RWM-U (ETH Zurich) treats “being uncertain about a state” as a safety signal: policies are penalized for imagining transitions the world model doesn’t understand. This creates a data-coverage-shaped safety envelope — deployable without formal constraints and validated on real hardware.
3. World models enable simulation-based safety evaluation, not just safe exploration. The 2026 comprehensive survey frames the main safety value as offline policy evaluation and OOD probing: generate adversarial scenarios in the world model before touching hardware. This is the most practically impactful use of world models for safety in near-term deployments.
4. The world model itself is an under-examined attack surface. Parmar 2026 identifies a gap: most safety work constrains the policy, but the world model is equally vulnerable. Data poisoning, latent space corruption, and planning hallucination (confident-but-wrong predictions) can bypass all policy-level safety filters.
5. There is no unified safety framework across approaches. Safe-RL methods (SafeDreamer), offline methods (RWM-U), and simulation-based safety probing target different threat models and make different assumptions. No single method handles all three.
Open Questions
- Can Lagrangian constraints in world models provide formal guarantees, or only empirical near-zero violations? The gap between “nearly zero” and “provably zero” matters for deployment.
- How do world-model safety methods handle model inaccuracy? SafeDreamer assumes the world model is accurate enough; RWM-U uses uncertainty to flag inaccuracy — but neither fully solves model error as a safety threat.
- What happens when the world model hallucinates a safe path through an unsafe region? Parmar 2026 names this “planning hallucination” but no solution exists yet.
- How do world-model safety approaches scale to contact-rich manipulation with discontinuous dynamics? Current benchmarks (Safety-Gymnasium) focus on locomotion and navigation, not manipulation.
- Multi-agent safety: all surveyed methods are single-agent. Multi-robot or human-robot settings require modeling other agents’ policies in the world model.
Report
The Core Idea
A world model is a learned function that predicts the next latent state (and reward/cost) given the current state and action. Once you have a differentiable simulator of this form, you can do two things for safety: (1) constrain what actions the policy is allowed to take by checking their predicted consequences in the world model, and (2) flag states where the world model is unreliable and avoid them.
Both strategies are active areas of research, and together they cover the most important failure modes in robot safety: constraint violations and overconfident extrapolation.
Safe RL in Latent Space: SafeDreamer
SafeDreamer (ICLR 2024, PKU-Alignment group) is the clearest example of strategy (1). It uses DreamerV3’s RSSM world model — which maps image observations to compact latent states and predicts forward dynamics — as the substrate for safety enforcement.
The key mechanism is Lagrangian relaxation applied to imagined rollouts. During training, the actor imagines a sequence of future states and actions inside the world model. For each imagined trajectory, SafeDreamer computes not just the predicted reward but also the predicted safety cost (e.g., probability of collision or constraint violation). A Lagrange multiplier scales the cost penalty: if the recent actual cost exceeds the safety threshold, the multiplier increases, pushing the actor to find lower-cost trajectories. This outer loop adjusts the Lagrange multiplier; the inner loop trains the actor against the penalized objective.
The result is near-zero constraint violations on Safety-Gymnasium, including vision-only tasks where model-free SafeRL methods (CPO, PCPO, PPO-Lag) consistently fail. The reason vision-based SafeRL is hard: it requires predicting safety consequences from raw pixels, which has high variance. The world model handles this by compressing pixels to latents first, making cost prediction much more tractable.
SafeDreamer’s limitation: safety guarantees are only as strong as the world model’s predictive fidelity. If the world model is wrong about cost consequences in novel states, the policy may inadvertently violate constraints in real deployment.
Uncertainty as a Safety Signal: RWM-U
RWM-U (ETH Zurich, arXiv 2504.16680, 2025) addresses a different failure mode — compounding errors in offline MBRL. The problem: policies trained on imagined rollouts from offline data tend to exploit regions where the world model makes optimistic errors (hallucinating high-reward, low-cost paths into uncharted territory).
RWM-U’s solution is epistemic uncertainty estimation via ensembles: train N world models on the same offline dataset; their disagreement on a predicted next state is the epistemic uncertainty signal. This signal is used in MOPO-PPO to penalize imagined transitions where the ensemble disagrees — effectively making the policy pessimistic about unfamiliar states.
This is a data-coverage-shaped safety envelope. The offline dataset defines the region of state space the world model understands. By penalizing uncertainty, RWM-U confines the policy to this understood region — which, assuming the training data was collected safely, corresponds to safe configurations.
What makes this paper particularly credible is the real-hardware evaluation on ANYmal quadruped and humanoid robots, trained entirely from offline datasets. Most offline RL papers stop at simulation; RWM-U goes further.
Simulation-Based Safety Probing
The 2026 survey by Hou et al. (including Abbeel, Malik, Wu) reframes world models’ safety role: rather than enforcing constraints at training time, use the world model as a safety-evaluation oracle. Specifically:
- Offline policy evaluation: before deploying a policy on hardware, evaluate it in the world model across thousands of randomly sampled or adversarially chosen scenarios
- OOD testing: generate states outside the training distribution (extreme sensor noise, unusual object positions) and check whether the policy fails
- Safety probing: systematically search for policy failure modes by treating the world model as a differentiable environment and running gradient-based adversarial attacks on the input space
This approach does not require any safety-specific training — it leverages any world model for post-hoc analysis. The limitation is that the world model itself must be accurate in the probe regions, which is not guaranteed for OOD inputs.
The World Model as an Attack Surface: Parmar 2026
Parmar 2026 (arXiv 2604.01346) identifies a problem most safety papers ignore: the world model itself is vulnerable. If an adversary can corrupt the world model (via data poisoning during training or latent representation attacks at inference), the safety filters built on top of it are also compromised.
Three threat surfaces:
- Adversarial: data poisoning during pre-training, or latent representation attacks at inference, causing the world model to predict safe trajectories through unsafe states
- Alignment: reward hacking via model inaccuracies (the policy finds imagined high-reward paths that violate real-world constraints but look safe to the world model), deceptive alignment
- Human factors: automation bias (operators trusting world model predictions beyond their validity), planning hallucination (confident-but-wrong safe-path predictions)
Planning hallucination deserves particular attention: it is distinct from epistemic uncertainty (which RWM-U addresses). A world model can be confidently wrong — its ensemble may agree on an incorrect prediction if the training data was systematically biased. No current method fully addresses this.
Parmar’s recommended mitigations — adversarial hardening, alignment engineering, NIST/EU AI Act compliance — are reasonable starting points but lack technical specificity. This remains an open area.
The Bigger Picture: Safety as a First-Class Concern in WAM
The field is transitioning from VLA (vision-language-action) to WAM (world-action model) architectures, as documented in the SF Robotics World Model Reading Club (2026-03). In this paradigm, the world model is not a policy component but the central learning substrate. Safety implications of this shift:
- More leverage: safety filters applied to the world model (Lagrangian constraints, uncertainty penalization) affect all downstream policies derived from it
- Higher stakes: a compromised world model propagates its errors into every policy it generates
- New failure modes: FastWAM’s insight — “training task ≠ inference task” — means a world model trained for general prediction may be over-confident in regions that were never specifically probed for safety
The convergence toward foundation-scale world models (Genie, DIAMOND, GR00T N2) raises the stakes further: a single world model serving many downstream policies becomes a high-value attack target.
Extended Scope: Implicit World Models and VLM-Based Safety Assessment
What Counts as a “World Model” for Safety?
The original survey focused on systems explicitly named “world models.” A broader framing — any system that generates a predicted 3D/4D representation of what will happen next, then uses that prediction for safety decisions — covers three additional research directions.
3D/4D Generative Models as Implicit World Models
Two vault-documented papers represent this direction:
Geometry-aware 4D Video Generation for Robot Manipulation (arXiv:2507.01099, ICLR 2026) supervises a video diffusion model with cross-view pointmap alignment, producing temporally and spatially consistent multi-view RGB-D sequences. The 4D prediction is not labeled a “world model” but functions as one: it predicts future scene states given a planned action, and a 6DoF pose tracker extracts end-effector trajectories from the generated video. Safety application: before executing an action, generate the predicted 4D future and check whether any predicted state violates constraints (collision, object drop, forbidden zone entry).
ParticleFormer (arXiv:2506.23126) uses a Transformer over 3D point clouds to predict particle-level dynamics for rigid, deformable, and granular materials. Integrated with MPPI for downstream control. Safety application: the particle dynamics model explicitly predicts contact forces and object deformation — which are the physical quantities you need to check for safe manipulation (grip force, cloth tearing, granular spillage). Unlike video-based models, ParticleFormer’s predictions are physically interpretable.
The core pattern shared by both: generate a 3D/4D future, check it for safety violations, filter or modify the action before execution. This is structurally identical to how SafeDreamer uses a latent world model, but operating in 3D/4D explicit space rather than learned latent space.
Trade-offs:
- 3D/4D explicit models (ParticleFormer, 4D video): physically interpretable, good for contact-rich manipulation, but expensive and scene-specific
- Latent world models (DreamerV3, RWM): cheaper, generalizable, but predictions are not directly interpretable as physical quantities
VLM as Safety Oracle: Semantic Risk Fields
Semantic-Metric Bayesian Risk Fields (arXiv:2512.08233, Stanford, December 2025) is the clearest example of VLM-for-safety. The architecture:
- A VLM receives the current scene image + object query → produces a semantic risk prior (“the knife is dangerous, the cutting board is not”)
- A learned ViT maps DINO features to pixel-aligned risk values, conditioned on the VLM prior
- Output: pixel-dense risk map, projectable into 3D for classical trajectory optimization
The VLM’s role is specifically the semantic layer: it understands that a knife is dangerous in a way that a pure metric safety filter (minimum distance to obstacles) cannot. The ViT grounds this semantic understanding spatially using human demonstration videos as training signal.
This pipeline can serve as a pre-action safety gate: before executing any planned action, render the predicted 3D/4D future state (from a world model or 4D generator), query the VLM risk field, and block or re-plan if predicted risk exceeds threshold.
The Circular Problem: VLMs Are Also Vulnerable
On the Vulnerability of LLM/VLM-Controlled Robotics (arXiv:2402.10340, 2024) shows that using VLMs on the safety-critical path introduces new vulnerabilities: simple input perturbations (instruction rephrasing, image noise) reduce task success rates by 14-22%. When VLM is used as a safety filter rather than a task executor, this translates to: a perturbed scene image could cause the VLM to misclassify a dangerous configuration as safe.
This creates a circular problem:
- World models are vulnerable (Parmar 2026: data poisoning, planning hallucination)
- VLMs used to check world model outputs are themselves vulnerable (Wu 2024: perceptual sensitivity)
No current method fully addresses both layers simultaneously.
Synthesis: The Pre-Execution Safety Pipeline
Combining these directions suggests a pre-execution safety pipeline:
Planned action
↓
3D/4D world model / generative simulator
(predict future scene state)
↓
VLM risk field query
(semantic + spatial safety check on predicted state)
↓
Accept / Modify / Reject action
This architecture appears in nascent form across separate papers but has not been built as an integrated system. The components exist (4D generators, VLM risk fields, world-model safe RL); the integration is the research gap.
Open challenges for this pipeline:
- 4D generation is too slow for real-time use (30s/10 steps in the 4D video paper)
- VLM risk assessment adds latency and inherits adversarial fragility
- The world model and VLM may disagree on what constitutes risk — no arbitration mechanism exists
- Formal guarantees are lost: the pipeline provides semantic plausibility, not provable safety
Extended Scope II: Semantic-Metric Risk Fields — Related Paper Cluster
Added 2026-05-14, based on follow-up survey session.
Architecture Lineage of Semantic-Metric Bayesian Risk Fields
The paper (arXiv:2512.08233, Stanford Schwager lab) sits at the intersection of four adjacent research lines:
VLM-as-cost architectural predecessor — VLMPC (arXiv:2407.09829, RSS 2024) VLMPC is the clearest precursor: planned action → action-conditioned video prediction → VLM evaluates predicted video → select optimal action. The hierarchical cost combines pixel-level visual alignment and knowledge-level semantic evaluation. Semantic-Metric Risk Fields can be understood as VLMPC’s cost function specialized for danger rather than task completion, plus Bayesian spatial grounding via a ViT trained on human demonstration videos.
VLA safety alignment counterpart — SafeVLA (arXiv:2503.03480, NeurIPS 2025 Spotlight) SafeVLA (PKU-Alignment, same group as SafeDreamer) applies CMDP constrained RL to VLA foundation models. While Semantic-Metric Risk Fields is an external oracle that evaluates robot trajectories, SafeVLA makes the VLA itself the safety reasoner. The ISA pipeline actively elicits unsafe VLA behaviors and constrains against them — 83.58% violation reduction, +3.85% task success.
Geometric complement — Dynamic Neural Potential Field / NPField-GPT (arXiv:2410.06819) A Transformer predicts footprint-aware repulsive potentials (spatiotemporal collision risk) and injects them as differentiable constraints into sequential quadratic MPC. Pure geometry, no semantics — but differentiable and integrable with existing MPC gradient optimization. Complementary with Semantic-Metric Risk Fields: geometric precision + semantic context.
Simplified LLM variant — Semantic Risk-Aware Heuristic Planning (arXiv:2605.02862, 2025) Uses LLM-generated cost functions to penalize geometrically cluttered or high-risk zones, injected into A* search with closed-loop replanning. Lighter-weight than the Bayesian formulation; suited for navigation but less applicable to contact-rich manipulation.
Attack surface mirror — LLM/VLM Vulnerability in Robotics (arXiv:2402.10340, 2024) Simple input perturbations reduce task success by 14–22% in LLM/VLM-controlled robots. Directly relevant to any pipeline that puts VLMs on the safety-critical path: the VLM safety oracle is itself vulnerable to the same adversarial inputs as the system it is guarding.
Key observation: Splat-Nav (arXiv:2403.02751, Timothy Chen, Mac Schwager) and Semantic-Metric Risk Fields are from the same Stanford lab. Splat-Nav provides Gaussian Splatting safe navigation; Semantic-Metric Risk Fields provides VLM semantic risk. They have not been integrated — that integration is the research gap described below.
Extended Scope III: The 3D/4D Generator + VLM Safety Check Research Gap
Added 2026-05-14, based on follow-up survey session.
What Exists: Three “Half Combinations”
Half 1 — 3D/4D prediction + learned safety check (no VLM)
Self-Correcting Robot Manipulation via Gaussian-Splatted Foresight (AAAI 2025, Pan et al.) The closest existing system. Uses 3D Gaussian Splatting to predict the future scene given a planned action; detects failure when predicted future diverges from real observation; rolls back the action. Achieves +12% over SOTA on RLBench 10 tasks. Limitation: the safety check is a geometric/pixel deviation detector, not a VLM — it cannot distinguish between “knife moving toward hand” and “cup tipping” as differently dangerous.
VLA-in-the-Loop (ICLR 2026 submission, Xu et al.) Event-triggered composite world model: a discriminative evaluator detects high-stakes actions (e.g., gripper closing); a generative video model synthesizes a “successful future trajectory”; an inverse dynamics model decodes the correction. The discriminator is learned, not a VLM.
Self-Correcting VLA via Sparse World Imagination (arXiv:2602.21633, 2026) Auxiliary predictive heads forecast task progress and future trajectory trends; an online refinement module adjusts trajectory orientation based on sparse predicted states. 16% fewer steps, 9% higher success rate, 14% gain on real hardware. No VLM safety component.
Half 2 — Video prediction + VLM evaluation (for task quality, not safety)
VLMPC (RSS 2024, arXiv:2407.09829) — as described in Extended Scope II. The VLM evaluation pipeline exists; swapping the VLM query from task quality to safety is architecturally straightforward but has not been done.
Half 3 — VLM safety check on current state (not predicted future state)
Semantic-Metric Bayesian Risk Fields (arXiv:2512.08233) — risk field is computed from the current observation, not from a generated future state. The temporal gap is the missing piece.
The Research Gap
The combination — generate predicted 3D/4D future state → feed predicted future to VLM → query “is this dangerous?” → filter or re-plan — does not exist as a published system. The components are:
Current scene observation
↓
3D Gaussian Splatting / 4D neural simulator
(predict future scene given planned action)
↓
Feed predicted future scene to VLM risk field
(Bayesian prior from VLM semantics + spatial ViT)
Query: "is the predicted outcome safe?"
↓
Accept / modify / reject action before execution
Why this combination matters: Current 3D/4D prediction safety checks (Gaussian-Splatted Foresight, VLA-in-the-Loop) use learned detectors that cannot generalize to novel dangerous configurations they were not trained to recognize. A VLM safety oracle brings semantic generalization — it knows a knife near a human is dangerous even if that exact configuration was never in training data.
Why it hasn’t been done yet (identified blockers):
- Latency: 3D/4D generation is too slow for real-time use (30s/10 steps for 4D video; Gaussian Splatting rollout is faster but still costly)
- VLM input format mismatch: VLMs expect RGB images; rendered future Gaussian scenes may have distribution shift from training images
- Compounding error: if the 3D/4D generator is wrong about the predicted future, the VLM check is evaluating a hallucinated scene
- No ground-truth safety labels: unlike task completion, safety ground truth from future predictions requires real failure data
Nearest prior art to watch:
- Splat-Nav + Semantic-Metric Risk Fields integration (same Stanford lab, not yet combined)
- VLMPC’s knowledge-level cost specialized for safety
- SV-VLA’s (arXiv:2604.02965) open-loop planning + closed-loop verification pattern, with VLM as verifier
中文版
研究問題
哪些方法使用世界模型(學習型環境動態預測模型)在機器人學中實現安全性?本調查涵蓋透過潛在空間約束滿足的安全強化學習、不確定性感知離線學習、基於模擬的安全探測,以及部署時風險分析。
知識地圖
- 世界模型(RSSM / DreamerV3) — 所有被調查的安全強化學習方法都建立在 Dreamer 系列潛在空間世界模型之上
- 安全強化學習(SafeRL) — 主導框架:分離獎勵和成本信號,將累積成本約束在閾值以下;Lagrangian 鬆弛和 CPO 是世界模型方法改進的標準基準
- Lagrangian 約束優化 — SafeDreamer 的核心機制;對偶變數優化在想像展開期間強制執行硬成本約束
- 深度學習中的認識論不確定性 — 集成方法作為「模型對此狀態了解多少?」的代理;RWM-U 和 MOPO 式離線強化學習的前提
- 離線批次強化學習 — 從固定資料集進行策略優化;分佈偏移和複合錯誤是不確定性懲罰所針對的關鍵失敗模式
- 帶安全過濾器的模型預測控制 — 傳統機器人安全基準;CBF 安全過濾器和帶約束的 NMPC 是世界模型方法競爭並補充的非機器學習方法
關鍵發現
-
潛在空間中的 Lagrangian 約束是主導的安全強化學習方法。 SafeDreamer 顯示,在世界模型內想像成本展開並應用對偶變數優化可實現接近零的約束違反——這是無模型 SafeRL 無法可靠實現的,尤其是在純視覺任務上。
-
不確定性懲罰是離線學習的主要安全機制。 RWM-U 將「對某狀態不確定」視為安全信號:如果世界模型不理解某個轉換,策略就會受到懲罰進入那裡。這建立了一個由資料覆蓋範圍決定的安全包絡,並在真實硬體上得到驗證。
-
世界模型實現基於模擬的安全評估,不僅僅是安全探索。 2026 年綜合調查將主要安全價值定義為離線策略評估和 OOD 探測:在接觸硬體之前在世界模型中生成對抗性場景。
-
世界模型本身是一個被低估的攻擊面。 Parmar 2026 識別出一個差距:大多數安全工作約束策略,但世界模型同樣脆弱——資料投毒、潛在空間破壞和規劃幻覺(自信但錯誤的預測)可以繞過所有策略級安全過濾器。
-
各方法之間沒有統一的安全框架。 安全強化學習方法(SafeDreamer)、離線方法(RWM-U)和基於模擬的安全探測針對不同的威脅模型並做出不同的假設。
未解問題
- 世界模型中的 Lagrangian 約束能提供形式保證,還是只有經驗性的接近零違反?
- 世界模型安全方法如何處理模型不準確?SafeDreamer 假設世界模型足夠準確;RWM-U 使用不確定性標記不準確——但兩者都沒有完全解決模型誤差作為安全威脅的問題。
- 當世界模型幻覺一條穿越不安全區域的安全路徑時會發生什麼?Parmar 2026 將此命名為「規劃幻覺」,但目前還沒有解決方案。
- 世界模型安全方法如何擴展到具有不連續動力學的接觸豐富操作?
- 多智能體安全:所有被調查的方法都是單智能體的。
報告摘要
機器人安全中世界模型的研究可分為三條主線:
第一條線:潛在空間約束強化學習。 SafeDreamer(ICLR 2024)透過在 DreamerV3 的潛在空間中想像未來軌跡並應用 Lagrangian 懲罰,實現接近零的約束違反。核心優勢:世界模型的可微模擬器使成本信號的反向傳播成為可能,這在像素空間中是不可行的。
第二條線:不確定性感知離線學習。 RWM-U(ETH Zurich, 2025)使用集成不確定性作為安全信號,懲罰想像的不確定轉換(MOPO-PPO)。本質是將「資料覆蓋範圍」視為安全包絡,並在真實四足機器人和人形機器人硬體上得到驗證。
第三條線:基於模擬的安全探測。 使用世界模型作為離線策略評估和對抗性安全測試的工具,在接觸真實硬體之前識別失敗模式。
跨越這三條線的開放問題:世界模型本身的安全性。規劃幻覺(自信但錯誤的預測)、對抗性資料投毒和潛在空間破壞是目前未解決的威脅,尤其在走向基礎規模世界模型(GR00T N2、DIAMOND)的背景下變得更加緊迫。