Summary

Microsoft Data Science team article covering the full LLM system evaluation lifecycle: the difference between evaluating a base LLM vs. an LLM-based system (RAG, fine-tuned), offline vs. online evaluation strategies, golden dataset construction, and the CI/CE/CD paradigm for LLMOps.

微軟資料科學團隊文章,涵蓋 LLM 系統評估完整生命週期:評估基礎 LLM vs. LLM 系統(RAG、微調)的差異、離線 vs. 線上評估策略、黃金數據集構建,以及 LLMOps 的 CI/CE/CD 範式。

Key Points

  • Key distinction: evaluating a base LLM (standardized benchmarks like GLUE, MMLU) vs. evaluating an LLM-based system (your RAG pipeline, fine-tuned model)
  • LLMOps adds Continuous Evaluation (CE) to CI/CD — evaluation is not a one-time step but an ongoing process
  • Evaluation frameworks surveyed: Azure AI Studio Prompt Flow, W&B + LangChain, LangSmith, DeepEval, TruEra
  • Offline evaluation: validates performance before deployment, uses static datasets, automatable in pipelines
  • Online evaluation: monitors live production behavior, catches real-world distribution shifts
  • Golden dataset construction: start with eyeballing → curate diverse inputs → human annotation → LLM-assisted generation (with human oversight)
  • LLM-as-judge pattern (QAEvalChain from LangChain) scales evaluation but requires validation against human labels

Insights

The article’s most useful framing is that once you add a RAG pipeline or fine-tune a model, evaluation responsibility shifts from the model provider to the builder. This is often underappreciated: teams assume a high-benchmark model will work well in their specific domain, but retrieval quality, prompt templates, and data preprocessing all affect output quality independent of base model capability. The CI/CE/CD framework makes evaluation a first-class engineering concern rather than an afterthought, which is the maturity jump from ad-hoc LLM apps to production-grade LLM systems.

Connections

Raw Excerpt

Does your evaluation process resemble the repetitive loop of running LLM applications on a list of prompts, manually inspecting outputs, and attempting to gauge quality based on each input? If so, it’s time to recognize that evaluation is not a one-time endeavor but a multi-step, iterative process.