2025 Large Language Model Year in Review

本文由 AI 分析生成

建立時間： 2026-03-22 來源： https://x.com/HiTw93/status/2012156583078510620

Summary

A translated and annotated summary of Simon Willison’s “2025: The Year in LLMs,” covering the three defining themes of 2025: the rise of reasoning models (RLVR), the maturation of AI agents (especially coding agents), and the emergence of Claude Code as the year’s most influential product launch. The piece argues that 2025’s capability gains came primarily from longer RL training rather than larger model scale, and that coding agents and AI-assisted search are the two proven agent deployment patterns.

對 Simon Willison 年度 LLM 回顧的中文翻譯與整理，涵蓋 2025 年三大主題：推理模型（RLVR）的崛起、AI Agent 的落地成熟（特別是編程 Agent），以及 Claude Code 作為年度最具影響力產品的誕生。文章指出 2025 年的能力進步主要來自更長的 RL 訓練而非更大的模型規模，編程 Agent 和 AI 輔助搜索是兩個已驗證的 Agent 落地場景。

Key Points

2025 = Year of Reasoning: o1 → o3/o3-mini/o4-mini; RLVR (Reinforcement Learning from Verifiable Rewards) is the core technique
Capability gains shifted from pre-training scale to RL training length — compute redirected from pre-training to RL
Reasoning models’ real value: multi-step tool use (plan → execute → observe → adjust), not just logic puzzles
AI-assisted search finally works: GPT-5 Thinking-style systems answer complex research questions reliably
Agent definition settled: “LLM systems that accomplish useful work through multi-step tool calls” — not AGI, but genuinely useful
Claude Code launched quietly in Feb (bundled in Claude 3.7 Sonnet announcement) — became the year’s most influential product
Major CLI coding agents: Claude Code, Codex CLI, Gemini CLI, Qwen Code, Mistral Vibe, plus vendor-neutral options (Amp, OpenCode, OpenHands CLI)
Two proven agent deployment patterns: coding and deep search

Insights

“Almost all capability progress in 2025 came from longer RL training, not larger model scale” — this reframes the compute narrative; the scaling law debate shifted from parameters to training regime
Claude Code launching without a dedicated blog post and still becoming the year’s defining product is a case study in letting the product speak — the engineering community discovered it organically
The version numbering skip (3.5 → 3.7, skipping 3.6 because the community named the silent upgrade) is a minor but telling detail about how fast the field moves and how organic naming conventions emerge
The author’s own prediction failure (“agents won’t land in 2025”) followed by updating the definition rather than the prediction is intellectually honest and worth noting as a modeling approach
Deep research mode (15+ minutes for detailed reports) becoming obsolete within a year because better systems could match quality in seconds shows how quickly “impressive demos” can become baseline expectations

Connections

Raw Excerpt

2025 年最具影响力的大事，是 2 月 Anthropic 静悄悄地发布了 Claude Code，甚至没单独发博客，只是夹在 Claude 3.7 Sonnet 的公告里。

bot_vault

Explorer

2025 Large Language Model Year in Review

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks