Improving Agent Systems & AI Reasoning

本文由 AI 分析生成

建立時間： 2026-03-28 來源： https://medium.com/towards-data-science/improving-agent-systems-ai-reasoning-c2d91ecfdf77

Summary

Tula Masterman’s analysis of how Reasoning Language Models (RLMs) — DeepSeek-R1, OpenAI o1/o3 — change agent system design. Covers train-time vs. test-time compute scaling, DeepSeek-R1’s multi-stage post-training pipeline (RL → SFT alternation), and implications for agent architecture.

Tula Masterman 對推理語言模型（RLM）——DeepSeek-R1、OpenAI o1/o3——如何改變智能體系統設計的分析。涵蓋訓練時與測試時計算擴展、DeepSeek-R1 的多階段後訓練流水線（RL → SFT 交替），以及對智能體架構的影響。

Key Points

Train-time compute scaling: pre-training + post-training (SFT, RL) — updates model parameters
Test-time compute scaling: at inference, explore multiple solution paths (Best-of-N, Beam Search, DVTS) — no parameter updates; smaller models can match larger ones with more inference compute
DeepSeek-R1 training pipeline: (1) RL on base model → R1-Zero (learned CoT but language-mixed); (2) SFT with CoT cold-start data; (3) RL with language consistency + reasoning rewards; (4) SFT for general capabilities; (5) final RL alignment → R1-671B
Distilled R1 models: SFT-only fine-tuning of Llama/Qwen (1.5B-70B) from R1 outputs, no RL needed
Impact on agents: RLMs enable streamlined single-agent workflows replacing multi-agent orchestration; tool-calling support still lagging (o3-mini first RLM with native tool calls)
Shift in UX: tasks delegated to background agents (minutes/hours) vs. immediate chat responses

Insights

The train/test-time compute distinction is the key conceptual divide: RL post-training builds reasoning into the model permanently (expensive, one-time), while test-time scaling rents compute at inference for each query (flexible, recurring cost). The DeepSeek-R1 multi-stage pipeline is notable for showing that alternating SFT and RL stages (rather than pure RL) produces more stable and capable reasoning models. The distillation finding — smaller models can learn reasoning behaviors via SFT from R1 outputs without RL — significantly lowers the barrier to capable reasoning models. The agent architecture implication (fewer, smarter agents vs. many specialized agents) aligns with the broader “simplicity” trend in agentic systems.

Connections

Raw Excerpt

Instead of relying on the developer to guide the entire reasoning and iteration process, there’s opportunities to allow the model to explore multiple solution paths, reflect on its progress, rank the best solution paths, and generally refine the overall reasoning lifecycle before sending a response to the user.

bot_vault

Explorer

Improving Agent Systems & AI Reasoning

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks