本文由 AI 分析生成
建立時間: 2026-03-28 來源: https://towardsdatascience.com/how-to-evaluate-llm-summarization-18a040c3905d
Summary
Isaac Tham’s framework for quantitatively evaluating LLM summaries, improving on DeepEval’s built-in summarization metric. Frames summarization quality as a precision/recall problem over facts, adds coherence and conciseness metrics, and demonstrates with DeepEval’s LLM-as-judge (GEval) infrastructure.
Isaac Tham 的 LLM 摘要量化評估框架,改進了 DeepEval 內置的摘要指標。將摘要質量定義為事實的精確度/召回率問題,添加了連貫性和簡潔性指標,並使用 DeepEval 的 LLM-as-judge(GEval)基礎設施進行演示。
Key Points
- Why summarization is hard to evaluate: open-ended output, subjective quality criteria, no easy synthetic gold standard (unlike RAG QA pairs)
- Traditional metrics (BLEU, ROUGE) don’t work well for abstractive LLM summaries
- 4 key qualities: Relevant, Concise, Coherent, Faithful (no hallucinations)
- Precision/recall framing: recall = “how many source facts retained?”; precision = “how many summary facts are supported by source?” (hallucination measure)
- Key insight: higher recall is only better holding length constant — 100% recall by copying the source text is not good
- DeepEval’s GEval: flexible LLM-as-judge for custom criteria; parallelized async evaluation at scale
Insights
The precision/recall reframing is the most valuable contribution: it converts a qualitative evaluation problem (“is this a good summary?”) into a quantitative one that can be automated with LLM-as-judge. The key asymmetry — hallucination (low precision) is catastrophically bad while missing some facts (lower recall) is merely imperfect — correctly identifies the priority ordering. Traditional BLEU/ROUGE failing on abstractive summaries is a known limitation; this article’s approach (LLM-as-judge for fact extraction + coverage scoring) is closer to how humans actually evaluate summaries.
Connections
Raw Excerpt
You can formulate Relevant and Concise as a precision and recall problem — how many facts from the source text are retained in the summary (recall), and how many facts from the summary are supported by the main text (precision). Hallucinating information is really bad: precision should be close to 100%.