本文由 AI 分析生成
Summary
Microsoft Data Science team article covering the full LLM system evaluation lifecycle: the difference between evaluating a base LLM vs. an LLM-based system (RAG, fine-tuned), offline vs. online evaluation strategies, golden dataset construction, and the CI/CE/CD paradigm for LLMOps.
微軟資料科學團隊文章,涵蓋 LLM 系統評估完整生命週期:評估基礎 LLM vs. LLM 系統(RAG、微調)的差異、離線 vs. 線上評估策略、黃金數據集構建,以及 LLMOps 的 CI/CE/CD 範式。
Key Points
- Key distinction: evaluating a base LLM (standardized benchmarks like GLUE, MMLU) vs. evaluating an LLM-based system (your RAG pipeline, fine-tuned model)
- LLMOps adds Continuous Evaluation (CE) to CI/CD — evaluation is not a one-time step but an ongoing process
- Evaluation frameworks surveyed: Azure AI Studio Prompt Flow, W&B + LangChain, LangSmith, DeepEval, TruEra
- Offline evaluation: validates performance before deployment, uses static datasets, automatable in pipelines
- Online evaluation: monitors live production behavior, catches real-world distribution shifts
- Golden dataset construction: start with eyeballing → curate diverse inputs → human annotation → LLM-assisted generation (with human oversight)
- LLM-as-judge pattern (QAEvalChain from LangChain) scales evaluation but requires validation against human labels
Insights
The article’s most useful framing is that once you add a RAG pipeline or fine-tune a model, evaluation responsibility shifts from the model provider to the builder. This is often underappreciated: teams assume a high-benchmark model will work well in their specific domain, but retrieval quality, prompt templates, and data preprocessing all affect output quality independent of base model capability. The CI/CE/CD framework makes evaluation a first-class engineering concern rather than an afterthought, which is the maturity jump from ad-hoc LLM apps to production-grade LLM systems.
Connections
Raw Excerpt
Does your evaluation process resemble the repetitive loop of running LLM applications on a list of prompts, manually inspecting outputs, and attempting to gauge quality based on each input? If so, it’s time to recognize that evaluation is not a one-time endeavor but a multi-step, iterative process.