本文由 AI 分析生成
Summary
Abhinav Yasaswi’s first-reaction post to OpenAI o1’s release (September 2024), framed around their graduate course experience testing GPT-4’s reasoning failures (strawberry counting, physics riddles). o1 solves several of these failures via RL-trained chain-of-thought, though still fails on some edge cases.
Abhinav Yasaswi 對 OpenAI o1 發布的第一反應文章(2024 年 9 月),以他們測試 GPT-4 推理失敗(草莓計數、物理謎題)的研究生課程經歷為框架。o1 通過 RL 訓練的思維鏈解決了其中幾個失敗,但仍然在某些邊緣情況下失敗。
Key Points
- GPT-4 failure modes: counting letters (“r’s in strawberry”), elementary logic, simple arithmetic, common sense physics — all require reasoning that standard LLMs lack
- o1 approach: RL-trained chain-of-thought — model generates internal reasoning, can question and correct itself (Reflection)
- o1 success: correctly answers “strawberry r count”, physics riddle (cup upside down → microwave); thinks for ~seconds before responding
- o1 still fails: some trivial riddles where the answer is embedded in the question; falls back on memorized content
- Jason Wei (chain-of-thought paper author from Google) worked on o1’s chain-of-thought integration at OpenAI
- Nuanced conclusion: significant progress, but real-world tasks still reveal limitations; benchmarks don’t capture everything
Insights
This article captures the immediate reception of reasoning models: the “strawberry” letter-counting failure was a widely-shared meme that o1 specifically addressed. The RL-trained chain-of-thought design means o1 doesn’t just retrieve a pattern — it reasons through the problem, which is why it succeeds on tasks requiring sequential logic that GPT-4 fails. The “still fails on trivial riddles where the answer is in the question” edge case is important: it shows that improved reasoning capability doesn’t eliminate retrieval/pattern-matching failure modes, just shifts where the frontier is.
Connections
Raw Excerpt
OpenAI trained the chain of thought generation process using Reinforcement learning. In the o1 models, the engineers were able to ask the model questions as to why it was wrong in its chain-of-thought process and it could identify the mistakes and correct itself.