LeWorldModel: Stable End-to-End JEPA World Model from Pixels

本文由 AI 分析生成

建立時間： 2026-03-26 來源： https://arxiv.org/abs/2603.19312

Summary

LeWorldModel (LeWM, 2026) is the first Joint-Embedding Predictive Architecture (JEPA) that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a Gaussian regularizer (SIGReg). Prior JEPAs required exponential moving averages, pre-trained encoders, or six-term losses to avoid representation collapse. With 15M parameters trainable on a single GPU in hours, it plans 48x faster than foundation-model-based approaches while achieving competitive control performance on 2D and 3D manipulation benchmarks.

LeWorldModel 是首個能從原始像素穩定端對端訓練的 JEPA，僅使用兩個損失函數：下一步嵌入預測損失和 SIGReg 高斯正則化器。先前的 JEPA 需要指數移動平均、預訓練編碼器或六項損失。15M 參數，單 GPU，規劃速度比基礎模型方法快 48 倍。

Prerequisites

Joint Embedding Predictive Architectures (JEPA) — Yann LeCun’s proposed world model architecture; understanding why predicting in latent space (not pixel space) is the design choice is foundational
Representation collapse in self-supervised learning — why training an encoder end-to-end without constraints causes all representations to collapse to a constant; the core problem this paper solves
Score-based / diffusion model intuitions — SIGReg is related to normality testing; understanding what “Gaussian-distributed embeddings” means and why it prevents collapse helps
Model Predictive Control (MPC) — the planning algorithm used at inference; how latent-space optimization over action sequences works
Vision Transformers (ViT) — the encoder backbone; patch tokenization and [CLS] token usage

Core Idea

Representation collapse in JEPAs is a distribution alignment problem: the encoder collapses because nothing forces it to maintain informative, diverse representations. SIGReg (Sketched-Isotropic-Gaussian Regularizer) addresses this statistically: it enforces that the latent distribution is Gaussian using the Epps-Pulley statistical normality test applied to one-dimensional random projections of the embeddings. This is a formal guarantee (not a heuristic) — if the test is satisfied, the embeddings cannot have collapsed to a degenerate distribution. This allows removing all the architectural complexity (EMA, stop-gradient, pre-trained features) that prior works needed.

Results

Task	LeWM	Best Baseline	Notes
Push-T (2D manipulation)	Higher	PLDM	+18% success rate
OGBench-Cube (3D manipulation)	Competitive	DINO-WM	Similar performance
Two-Room (navigation)	Lower	PLDM, DINO-WM	Simple low-dim environment
Planning speed	48x faster	DINO-WM	200x fewer latent tokens
Training	Single GPU, hours	Multi-GPU, days	vs. DINO-WM

Physical reasoning: LeWM assigns higher surprise to teleporting objects vs. visual perturbations — latent space internalizes physical causality without explicit supervision.

Limitations

Author-stated: short planning horizons — current MPC does not scale to long-horizon tasks
Author-stated: relies on offline datasets with sufficient state coverage; fails in data-sparse regions
Unstated: Two-Room underperformance suggests SIGReg’s Gaussian prior is a poor fit for low-intrinsic-dimensionality environments — the regularizer may be too strong

Reproducibility

Code: likely available (paper from 2026 with open-source framing); check arXiv for repository link
Datasets: Push-T, OGBench-Cube, Two-Room — all standard benchmarks
Compute: single GPU (A100 class likely); hours of training

Insights

SIGReg’s key insight: instead of preventing collapse through architectural tricks (EMA teacher networks, stop-gradients), enforce a statistical property on the output distribution. This is a principled solution rather than an empirical one. The physical understanding results (violation-of-expectation tests) are the most compelling contribution beyond the engineering — they suggest that Gaussian-regularized latent prediction naturally discovers causal structure, which is the core of LeCun’s JEPA research program.

Connections

Raw Excerpt

“The first JEPA that trains stably end-to-end from raw pixels using only two loss terms… No stop-gradient, exponential moving averages, or pre-trained representations. LeWM plans 48x faster than foundation-model alternatives while maintaining competitive control performance.”

bot_vault

Explorer

LeWorldModel: Stable End-to-End JEPA World Model from Pixels

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks