Summary

RWM-U (ETH Zurich, 2025) extends autoregressive robotic world models with ensemble-based epistemic uncertainty estimation. The key safety insight: by penalizing imagined transitions where the world model is uncertain (MOPO-PPO), the policy avoids regions of the state space not covered by the offline dataset — which corresponds directly to avoiding dangerous or untested configurations. Demonstrated on real quadruped (ANYmal) and humanoid hardware for manipulation and locomotion tasks.

RWM-U(ETH Zurich, 2025)透過集成式認識不確定性估計擴展自回歸機器人世界模型。核心安全洞察:透過懲罰世界模型不確定的模擬轉換(MOPO-PPO),策略避免了離線資料集未覆蓋的狀態空間區域——直接對應於避免危險或未測試的配置。在真實四足機器人(ANYmal)和人形機器人硬體上針對操作和運動任務進行了驗證。

Key Points

  • RWM architecture: autoregressive transformer world model predicting next state token-by-token from action and history
  • Uncertainty via ensembles: train N world models; disagreement between predictions = epistemic uncertainty signal
  • MOPO-PPO: adapts the Model-based Offline Policy Optimization framework to PPO; penalizes reward with uncertainty estimate during imagined rollouts
  • Real-robot deployment: unlike most offline RL papers, RWM-U is actually deployed on ANYmal quadruped and humanoid hardware
  • Key problem solved: compounding errors in long-horizon rollouts — uncertainty propagation prevents the model from confidently predicting into regions it has never seen

Insights

The uncertainty penalization is functionally equivalent to a data-driven safety barrier: the offline dataset defines “known safe” states, and the uncertainty signal prevents the policy from venturing beyond it. This is a weaker form of safety than formal constraint satisfaction (no hard guarantees) but is far more practical for real hardware deployment.

The ETH Zurich group (Marco Hutter lab) is notable for consistently deploying learned policies on actual legged robots — the paper’s real-hardware results make it especially credible.

Connections