Summary

Part 1 of a 4-part series introducing MLOps for ML practitioners. Covers the evolution from DevOps to MLOps, the ML model lifecycle stages, core MLOps principles, benefits, challenges, and tooling overview. Targets readers who build ML models but lack exposure to production deployment practices. Tools covered include DVC, PyCaret, MLflow, and FastAPI.

MLOps 四篇系列的第一篇,介紹 MLOps 的核心概念:從 DevOps 到 MLOps 的演進、ML 模型生命週期、MLOps 原則、優勢、挑戰和工具概覽。目標讀者為有 ML 開發經驗但缺乏生產部署知識的從業者。

Key Points

  • Core problem: ML code is a small fraction of real-world ML systems — the surrounding production infrastructure is the hard part (Hidden Technical Debt in ML Systems framing)
  • DevOps → MLOps: MLOps extends DevOps CI/CD principles to cover data versioning, model experimentation, evaluation, monitoring, and retraining — not just code deployment
  • ML lifecycle stages: data ingestion → validation → preprocessing → model training → evaluation → serving → monitoring → retraining loop
  • MLOps principles: reproducibility, automation, continuous training (CT), continuous monitoring (CM), versioning (code + data + model), collaboration
  • Tooling landscape mentioned: DVC (data versioning), MLflow (experiment tracking), PyCaret (AutoML), FastAPI (model serving), AWS (cloud deployment)
  • Adoption gap: survey at time of writing showed only ~25% of ML projects reached production — MLOps addresses the 75% failure rate

Insights

The framing “ML code is a small fraction of real ML systems” (from Google’s 2015 NIPS paper) remains the most useful mental model for understanding why MLOps exists. Most ML practitioners underestimate the infrastructure surface area: data pipelines, serving infrastructure, monitoring, feedback loops, and retraining automation dwarf the modeling code.

The 2021 vintage of this article means the tooling landscape has changed (Weights & Biases, Ray, Vertex AI, SageMaker Pipelines have matured significantly), but the lifecycle and principles sections remain accurate and pedagogically useful.

「ML 程式碼只是系統的一小部分」這個框架仍是理解 MLOps 存在意義的最佳心理模型。此文 2021 年的工具景觀已有所變化,但生命週期和原則部分仍然準確,具有教學價值。

Connections

Raw Excerpt

It takes far longer to deploy ML models to production than it does to create them. The actual ML code makes up just a small portion of real-world ML systems, while the surrounding infrastructure in the production environment is extensive and complicated.