Summary

A deep dive into DeepSeek-R1’s training methodology, explaining how the open-source model matches or beats OpenAI’s o1 through a combination of supervised fine-tuning on chain-of-thought data, GRPO (Group Relative Policy Optimization) reinforcement learning, and a unique “aha moment” training stage. The article contextualizes this in the broader story of AI openness vs. proprietary development.

深入介紹 DeepSeek-R1 的訓練方法,說明這個開源模型如何透過鏈式推理數據的監督微調、強化學習和特殊訓練階段達到或超越 OpenAI o1 的水準,並從 AI 開放性角度進行背景分析。

Key Points

  • DeepSeek-R1 openly published its training methodology, contrasting with OpenAI’s secrecy since GPT-2
  • Uses GRPO reinforcement learning (not PPO) to train reasoning capabilities
  • Training pipeline: cold start SFT → RL with GRPO → rejection sampling → final SFT+RL
  • The “aha moment” behavior — model self-correcting mid-reasoning — emerges from RL training, not explicit supervision
  • Released as open weights, accelerating the broader AI research community

Insights

DeepSeek-R1 demonstrated that explicit chain-of-thought reasoning can be trained with RL even without having access to a reasoning teacher model — you can bootstrap from scratch. The GRPO optimization (comparing outputs within a group rather than against a fixed critic) is more compute-efficient than PPO for this use case. The transparency of the training recipe has arguably accelerated reasoning model research by 12-18 months for the broader community.

Connections

Raw Excerpt

With OpenAI releasing ‘o1’ a few months ago, they have really discovered something… And then DeepSeek published their full training methodology, making it available to the world.