本文由 AI 分析生成
建立時間: 2026-03-26 來源: https://arxiv.org/abs/2603.19312
Summary
LeWorldModel (LeWM, 2026) is the first Joint-Embedding Predictive Architecture (JEPA) that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a Gaussian regularizer (SIGReg). Prior JEPAs required exponential moving averages, pre-trained encoders, or six-term losses to avoid representation collapse. With 15M parameters trainable on a single GPU in hours, it plans 48x faster than foundation-model-based approaches while achieving competitive control performance on 2D and 3D manipulation benchmarks.
LeWorldModel 是首個能從原始像素穩定端對端訓練的 JEPA,僅使用兩個損失函數:下一步嵌入預測損失和 SIGReg 高斯正則化器。先前的 JEPA 需要指數移動平均、預訓練編碼器或六項損失。15M 參數,單 GPU,規劃速度比基礎模型方法快 48 倍。
Prerequisites
- Joint Embedding Predictive Architectures (JEPA) — Yann LeCun’s proposed world model architecture; understanding why predicting in latent space (not pixel space) is the design choice is foundational
- Representation collapse in self-supervised learning — why training an encoder end-to-end without constraints causes all representations to collapse to a constant; the core problem this paper solves
- Score-based / diffusion model intuitions — SIGReg is related to normality testing; understanding what “Gaussian-distributed embeddings” means and why it prevents collapse helps
- Model Predictive Control (MPC) — the planning algorithm used at inference; how latent-space optimization over action sequences works
- Vision Transformers (ViT) — the encoder backbone; patch tokenization and [CLS] token usage
Core Idea
Representation collapse in JEPAs is a distribution alignment problem: the encoder collapses because nothing forces it to maintain informative, diverse representations. SIGReg (Sketched-Isotropic-Gaussian Regularizer) addresses this statistically: it enforces that the latent distribution is Gaussian using the Epps-Pulley statistical normality test applied to one-dimensional random projections of the embeddings. This is a formal guarantee (not a heuristic) — if the test is satisfied, the embeddings cannot have collapsed to a degenerate distribution. This allows removing all the architectural complexity (EMA, stop-gradient, pre-trained features) that prior works needed.
Results
| Task | LeWM | Best Baseline | Notes |
|---|---|---|---|
| Push-T (2D manipulation) | Higher | PLDM | +18% success rate |
| OGBench-Cube (3D manipulation) | Competitive | DINO-WM | Similar performance |
| Two-Room (navigation) | Lower | PLDM, DINO-WM | Simple low-dim environment |
| Planning speed | 48x faster | DINO-WM | 200x fewer latent tokens |
| Training | Single GPU, hours | Multi-GPU, days | vs. DINO-WM |
Physical reasoning: LeWM assigns higher surprise to teleporting objects vs. visual perturbations — latent space internalizes physical causality without explicit supervision.
Limitations
- Author-stated: short planning horizons — current MPC does not scale to long-horizon tasks
- Author-stated: relies on offline datasets with sufficient state coverage; fails in data-sparse regions
- Unstated: Two-Room underperformance suggests SIGReg’s Gaussian prior is a poor fit for low-intrinsic-dimensionality environments — the regularizer may be too strong
Reproducibility
- Code: likely available (paper from 2026 with open-source framing); check arXiv for repository link
- Datasets: Push-T, OGBench-Cube, Two-Room — all standard benchmarks
- Compute: single GPU (A100 class likely); hours of training
Insights
SIGReg’s key insight: instead of preventing collapse through architectural tricks (EMA teacher networks, stop-gradients), enforce a statistical property on the output distribution. This is a principled solution rather than an empirical one. The physical understanding results (violation-of-expectation tests) are the most compelling contribution beyond the engineering — they suggest that Gaussian-regularized latent prediction naturally discovers causal structure, which is the core of LeCun’s JEPA research program.
Connections
- JEPA (Joint Embedding Predictive Architecture)
- Yann LeCun’s world model research program
- DreamerV3 (task-specific world model)
- DINO-WM (foundation-model-based world model)
- Model Predictive Control
- representation collapse in self-supervised learning
Raw Excerpt
“The first JEPA that trains stably end-to-end from raw pixels using only two loss terms… No stop-gradient, exponential moving averages, or pre-trained representations. LeWM plans 48x faster than foundation-model alternatives while maintaining competitive control performance.”