本文由 AI 分析生成
建立時間: 2026-04-05 來源: https://arxiv.org/html/2604.02029
Summary
This survey positions continuous latent space as an emerging alternative computational substrate to token-level generation in language models, arguing that explicit verbal reasoning introduces unnecessary redundancy, discretization loss, and sequential bottlenecks. The authors organize the field across five dimensions — Foundation, Evolution, Mechanism, Ability, Outlook — and propose a two-dimensional Mechanism × Ability taxonomy covering 37+ authors’ collective review of the landscape from COCONUT-era prototypes through current unified frameworks.
本論文將連續 latent space 定位為語言模型中 token 層級生成的替代計算基底,主張顯式語言推理引入冗餘、離散化損失與序列瓶頸。作者從基礎、演進、機制、能力、展望五個維度系統梳理該領域,提出 Mechanism × Ability 二維分類法,涵蓋 reasoning、planning、embodiment 等七大能力域。
Prerequisites
- Autoregressive language models — the baseline system this survey critiques; understanding token-level generation is necessary to appreciate why latent computation is proposed as superior.
- Representation learning / embeddings — latent space is built on continuous vector representations; readers need to understand how hidden states encode semantic content.
- Chain-of-thought reasoning — the explicit-space counterpart being compared; the survey’s argument hinges on contrasting CoT’s verbalization overhead with latent computation’s efficiency.
- World models / model-based RL — relevant for the Modeling and Embodiment ability domains, where latent representations serve as compact environment simulators.
Core Idea
The central claim is that forcing language models to reason entirely through human-readable tokens is architecturally wasteful: tokens impose discretization loss, require sequential decoding, and introduce linguistic redundancy unrelated to the underlying computational task. Latent space — continuous, high-dimensional, machine-native — allows models to perform internal computation without these constraints. The survey’s contribution is not a new method but a unifying framework: by decomposing the field into four mechanism types (Architecture, Representation, Computation, Optimization) crossed with seven ability domains (Reasoning, Planning, Modeling, Perception, Memory, Collaboration, Embodiment), it gives researchers a shared vocabulary for a previously fragmented literature. The four-stage evolution timeline (Prototype → Formation → Expansion → Outbreak) further contextualizes where individual works sit in the broader trajectory.
Results
This is a survey paper; no new benchmarks are reported. Key representative systems cited:
- COCONUT — continuous thought loops; prototype-stage latent reasoning
- MemGen — latent memory for long-context retention
- Mirage — visual thinking via latent perception
- UniVLA — embodied action generation in latent space (relevant to VLA/robotics domain)
No aggregate benchmark table is provided in the survey itself.
Limitations
- Author-stated: Evaluability constraints make it hard to compare systems operating in latent space; interpretability remains an open problem; no standardized cross-modal benchmarks exist.
- Unstated: The survey covers cs.AI broadly but the embodiment/robotics section (UniVLA, physical action generation) appears to be a minor subsection rather than a deep treatment — readers primarily interested in robotics VLA models may find coverage thin compared to the reasoning/planning sections.
- Unstated: With 37+ authors from many institutions, the survey risks being encyclopedic rather than opinionated; the Mechanism × Ability taxonomy is useful but may flatten important distinctions between architecturally incompatible approaches.
Reproducibility
- Code: Not applicable (survey paper); authors mention a companion GitHub repository for resources
- Datasets: Survey covers works using standard NLP and vision benchmarks
- Compute: Not applicable
Insights
The four-stage evolution timeline is useful for situating individual papers: the “Outbreak” stage (Dec 2025–present) coincides with rapid architectural specialization, suggesting the field is transitioning from proof-of-concept to engineering maturity. The Embodiment domain — where latent representations encode physical actions rather than tokens — is directly relevant to VLA model development and LfD in robotics; UniVLA is flagged as a representative work worth following. The survey’s framing implies that future robotics foundation models may increasingly operate natively in latent space rather than producing language-mediated action plans, which aligns with emerging trends in VLA architectures.
Connections
- UniVLA — embodied action generation in latent space; directly relevant to VLA/LfD
- how-do-we-research-hri-age-of-llms-systematic-review-arxiv-2602.15063 — parallel survey on LLM-HRI; both map emerging fields with multi-author systematic reviews
- lerobot-open-source-robot-learning-library-arxiv — robotics learning library that may benefit from latent-space planning approaches
Raw Excerpt
“Many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces.”