Summary

A hands-on MLOps tutorial covering DVC (Data Version Control), an open-source CLI tool that brings Git-like versioning to large datasets and ML models. DVC stores lightweight .dvc pointer files in Git while actual data lives in remote storage (e.g., S3). The tutorial walks through initializing DVC, configuring an S3 remote, migrating a dataset from Git tracking to DVC tracking, and pushing versioned data — with the next step being full automation via Airflow on Kubernetes.

這是一篇 MLOps 實作教學,介紹 DVC(Data Version Control)如何解決 Git 無法版本控制大型資料集的問題。DVC 在 Git 中儲存輕量的 .dvc 指標檔案,實際資料則存放在 S3 等遠端儲存。教學涵蓋初始化 DVC、設定 S3 遠端、將資料集從 Git 追蹤遷移至 DVC,以及推送版本化資料,下一步將整合 Airflow on Kubernetes 實現全自動化。

Key Points

  • DVC = “Git for Data”: stores .dvc pointer files in Git, actual data in remote storage (S3)
  • Git and DVC cannot both track the same file — must git rm --cached before dvc add
  • DVC uses content-addressed storage (MD5 hash): only stores genuinely new content, no duplication
  • In production: Airflow ETL DAG runs dvc add + dvc push automatically after each pipeline run
  • Data scientists consume via git checkout + dvc pull to reproduce exact code+data state
  • git commit after dvc push is critical — without it, the dataset version is not linked to the codebase

Insights

  • The split of responsibilities is clean: Git owns code and metadata (.dvc files), DVC owns the sync layer, S3 owns the actual bytes — each tool does what it’s good at
  • Content-addressed storage means versioning is space-efficient: incremental changes only store diffs at the file level (new hash = new file stored, unchanged = reused)
  • DevOps engineers own the S3 bucket, IAM permissions, and DVC CLI provisioning in Airflow workers — this is infrastructure work, not data science work
  • The git checkout + dvc pull reproducibility pattern is the key value proposition: any past experiment state can be exactly reconstructed
  • Next step (Airflow on K8s) closes the last manual loop — currently dvc push still requires a human; automating it makes the entire data pipeline fully self-versioning

Connections

Raw Excerpt

DVC stores lightweight pointer files (.dvc files) in Git and the actual data resides in a remote storage (Eg., Amazon S3). Simply put, it is the bridge between your Git repo and storage where data resides.