Summary

EN: A LessWrong discussion thread where a security startup founder argues that much of the celebrated AI benchmark progress doesn’t translate to real-world usefulness. The post documents how models perform impressively on standardized tests but fail in production scenarios, attributes this to Goodhart’s Law (benchmarks become targets and stop measuring the thing we care about), and suggests “Claude plays Pokémon” as a better proxy for genuine capability. Also critiques sycophancy as a particularly damaging production failure mode.

ZH: LessWrong 討論串,一位安全新創創辦人指出 AI 基準測試的進步未能反映實際生產環境的效用,將此歸因於古德哈特定律(基準一旦成為目標就失去測量價值),並建議以「Claude 玩寶可夢」作為更真實的能力代理指標,同時批評模型諂媚性(sycophancy)是生產環境中最具破壞性的失敗模式。

Key Points

  • Benchmark inflation: models ace MMLU/HumanEval but struggle with messy, real-world tasks
  • Goodhart’s Law applied to AI: optimizing benchmark scores diverges from optimizing actual usefulness
  • Sycophancy in production: models agree with wrong premises, validate faulty reasoning — serious problem for any high-stakes use case
  • “Claude plays Pokémon” cited as a better benchmark — requires multi-step reasoning, memory, adaptation, and goal persistence
  • Benchmark contamination (training data overlaps with test sets) is a credible concern rarely disclosed by labs
  • The author’s security startup experience shows the gap is particularly wide in specialized domains

Insights

  • The sycophancy critique is especially sharp: a model that tells users what they want to hear is actively harmful in debugging, security review, or medical contexts
  • The Pokémon benchmark idea captures something real: long-horizon, exploratory tasks with sparse rewards are more representative of useful AI than single-turn QA
  • Labs have structural incentives to report benchmark improvements — these are the most legible signal for funding and press

Connections

  • Relates to the AI revolution 2024 article: competitive margins narrowing partly because everyone optimizes the same benchmarks
  • Connects to PromptWizard: automated prompt optimization risks Goodharting on the evaluation metric used for optimization
  • The “knowledge paradox” from the 70% problem applies inversely here: benchmarks test the 30% seniors could do anyway

Raw Excerpt

“These models score in the 90th percentile on every benchmark I can find, then fail simple tasks in our actual product. I’ve started thinking of benchmark scores the way I think of interview performance — necessary signal but deeply insufficient. The model that plays Pokémon for 40 hours without forgetting its goals tells me more than any MMLU score.”