Deep Generative Models for Learning from Multimodal Demonstrations

Summary

TRO survey covering how energy-based models, diffusion models, action value maps, and GANs are used to learn robot behaviors from demonstration data. Addresses the core limitation of classical behavior cloning: averaging across multimodal demonstrations produces policies that don’t match any single human strategy. Generative models capture the full distribution of expert behaviors.

TRO 調查涵蓋能量模型、擴散模型、動作價值地圖和 GAN 如何從示範資料學習機器人行為。解決了傳統行為克隆的核心限制：對多模態示範取平均會產生與任何單一人類策略都不匹配的策略。生成模型捕捉專家行為的完整分佈。

Key Points

Classical behavior cloning fails on multimodal demonstrations — it learns the average, which may be physically invalid
Diffusion models are the dominant current approach: high-quality multimodal outputs, but slow inference
Energy-based models: flexible distribution modeling, but training stability issues
GANs: fast sampling, but mode collapse limits reliability
Core application areas: grasp generation, trajectory generation, cost/reward learning
Open challenge: out-of-distribution generalization — models fail on states not in the demonstration dataset

Insights

The multimodality problem is fundamental and underappreciated in robot learning: when five different people demonstrate the same pick-and-place task, they use five different grasps, paths, and speeds. Averaging them produces a “ghost” policy that no one actually uses — and that often fails because it’s not mechanically valid
Diffusion models’ dominance in current LfD reflects a broader ML pattern: the same architecture that works well for image generation (stable multimodal distribution modeling) transfers to action generation once you discretize time appropriately
The diffusion model’s slow inference is the practical bottleneck for real-time control — this is exactly what Discrete Diffusion VLAs (Trend #1 in the ICLR 2026 survey) are solving with parallel generation
Cost/reward learning from demonstration is the connection to RL: if you can learn a reward function from demonstrations, you can then use RL to optimize it — this is the IRL (Inverse Reinforcement Learning) thread

Connections

Raw Excerpt

Classical methods have relied on models that don’t capture complex data distributions well or don’t scale to large datasets. Deep generative models address this by modeling the full distribution of expert behaviors.

bot_vault

Explorer

Deep Generative Models for Learning from Multimodal Demonstrations

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks