Summary

LeWorldModel (LeWM, 2026) is the first Joint-Embedding Predictive Architecture (JEPA) that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a Gaussian regularizer (SIGReg). Prior JEPAs required exponential moving averages, pre-trained encoders, or six-term losses to avoid representation collapse. With 15M parameters trainable on a single GPU in hours, it plans 48x faster than foundation-model-based approaches while achieving competitive control performance on 2D and 3D manipulation benchmarks.

LeWorldModel 是首個能從原始像素穩定端對端訓練的 JEPA,僅使用兩個損失函數:下一步嵌入預測損失和 SIGReg 高斯正則化器。先前的 JEPA 需要指數移動平均、預訓練編碼器或六項損失。15M 參數,單 GPU,規劃速度比基礎模型方法快 48 倍。

Prerequisites

  • Joint Embedding Predictive Architectures (JEPA) — Yann LeCun’s proposed world model architecture; understanding why predicting in latent space (not pixel space) is the design choice is foundational
  • Representation collapse in self-supervised learning — why training an encoder end-to-end without constraints causes all representations to collapse to a constant; the core problem this paper solves
  • Score-based / diffusion model intuitions — SIGReg is related to normality testing; understanding what “Gaussian-distributed embeddings” means and why it prevents collapse helps
  • Model Predictive Control (MPC) — the planning algorithm used at inference; how latent-space optimization over action sequences works
  • Vision Transformers (ViT) — the encoder backbone; patch tokenization and [CLS] token usage

Core Idea

Representation collapse in JEPAs is a distribution alignment problem: the encoder collapses because nothing forces it to maintain informative, diverse representations. SIGReg (Sketched-Isotropic-Gaussian Regularizer) addresses this statistically: it enforces that the latent distribution is Gaussian using the Epps-Pulley statistical normality test applied to one-dimensional random projections of the embeddings. This is a formal guarantee (not a heuristic) — if the test is satisfied, the embeddings cannot have collapsed to a degenerate distribution. This allows removing all the architectural complexity (EMA, stop-gradient, pre-trained features) that prior works needed.

Results

TaskLeWMBest BaselineNotes
Push-T (2D manipulation)HigherPLDM+18% success rate
OGBench-Cube (3D manipulation)CompetitiveDINO-WMSimilar performance
Two-Room (navigation)LowerPLDM, DINO-WMSimple low-dim environment
Planning speed48x fasterDINO-WM200x fewer latent tokens
TrainingSingle GPU, hoursMulti-GPU, daysvs. DINO-WM

Physical reasoning: LeWM assigns higher surprise to teleporting objects vs. visual perturbations — latent space internalizes physical causality without explicit supervision.

Limitations

  • Author-stated: short planning horizons — current MPC does not scale to long-horizon tasks
  • Author-stated: relies on offline datasets with sufficient state coverage; fails in data-sparse regions
  • Unstated: Two-Room underperformance suggests SIGReg’s Gaussian prior is a poor fit for low-intrinsic-dimensionality environments — the regularizer may be too strong

Reproducibility

  • Code: likely available (paper from 2026 with open-source framing); check arXiv for repository link
  • Datasets: Push-T, OGBench-Cube, Two-Room — all standard benchmarks
  • Compute: single GPU (A100 class likely); hours of training

Insights

SIGReg’s key insight: instead of preventing collapse through architectural tricks (EMA teacher networks, stop-gradients), enforce a statistical property on the output distribution. This is a principled solution rather than an empirical one. The physical understanding results (violation-of-expectation tests) are the most compelling contribution beyond the engineering — they suggest that Gaussian-regularized latent prediction naturally discovers causal structure, which is the core of LeCun’s JEPA research program.

Connections

Raw Excerpt

“The first JEPA that trains stably end-to-end from raw pixels using only two loss terms… No stop-gradient, exponential moving averages, or pre-trained representations. LeWM plans 48x faster than foundation-model alternatives while maintaining competitive control performance.”