本文由 AI 分析生成
建立時間: 2026-03-28 來源: https://www.slideshare.net/slideshow/eval-driven-development-edd-ai/271729781#5
Summary
A SlideShare presentation on Eval-Driven Development (EDD), a methodology for managing the non-determinism inherent in generative AI systems. The slides cover building evaluation datasets (both labeled and unlabeled), using LLM-as-judge scoring (G-Eval), and automated prompt optimization via TextGrad. The context is a real investment chatbot classification task (classify questions as Y/F/N for stock-specific / general finance / unrelated).
一份關於評估驅動開發(EDD)的簡報,介紹管理生成式 AI 系統固有不確定性的方法論。內容涵蓋建立評估資料集(有標籤和無標籤)、使用 LLM-as-judge 評分(G-Eval)、以及透過 TextGrad 自動化 Prompt 優化。背景是一個真實的投資聊天機器人分類任務(將問題分類為 Y/F/N:特定股票/一般財務/無關)。
Key Points
- EDD: treat evaluation as a first-class citizen in AI development — build eval datasets before or alongside building features
- Three evaluation scenarios: (1) labeled ground-truth available, (2) no standard answer (use LLM-as-judge / G-Eval), (3) RAG with reference documents
- LLM-as-judge prompt patterns: answer relevance, groundedness (hallucination check), context relevance, answer correctness vs ground truth
- Synthetic dataset generation: use AI to generate test questions across positive/negative/edge-case categories; use HyDE (Hypothetical Document Embeddings) for RAG eval
- TextGrad: gradient-based automatic prompt optimization — treats prompt as a differentiable parameter, iterates to minimize eval loss
- Chain-of-Thought in eval prompts: put reasoning before the score/classification output to improve judge accuracy
Insights
The key insight is that non-determinism in LLMs makes traditional software testing approaches insufficient — you can’t write unit tests against stochastic outputs. EDD addresses this by making “is this output good enough?” a measurable, automated question rather than a manual review. The TextGrad approach is particularly interesting: by treating the prompt as a parameter to be optimized rather than something to be hand-tuned, it applies the same gradient-descent intuition from ML training to prompt engineering itself.
Connections
Raw Excerpt
用魔法對付魔法,使用 LLM as a judge 讓 AI 打分數 • G-Eval 是常見的寫法