Summary
A hands-on MLOps tutorial covering DVC (Data Version Control), an open-source CLI tool that brings Git-like versioning to large datasets and ML models. DVC stores lightweight .dvc pointer files in Git while actual data lives in remote storage (e.g., S3). The tutorial walks through initializing DVC, configuring an S3 remote, migrating a dataset from Git tracking to DVC tracking, and pushing versioned data — with the next step being full automation via Airflow on Kubernetes.
這是一篇 MLOps 實作教學,介紹 DVC(Data Version Control)如何解決 Git 無法版本控制大型資料集的問題。DVC 在 Git 中儲存輕量的 .dvc 指標檔案,實際資料則存放在 S3 等遠端儲存。教學涵蓋初始化 DVC、設定 S3 遠端、將資料集從 Git 追蹤遷移至 DVC,以及推送版本化資料,下一步將整合 Airflow on Kubernetes 實現全自動化。
Key Points
- DVC = “Git for Data”: stores
.dvcpointer files in Git, actual data in remote storage (S3) - Git and DVC cannot both track the same file — must
git rm --cachedbeforedvc add - DVC uses content-addressed storage (MD5 hash): only stores genuinely new content, no duplication
- In production: Airflow ETL DAG runs
dvc add+dvc pushautomatically after each pipeline run - Data scientists consume via
git checkout+dvc pullto reproduce exact code+data state git commitafterdvc pushis critical — without it, the dataset version is not linked to the codebase
Insights
- The split of responsibilities is clean: Git owns code and metadata (
.dvcfiles), DVC owns the sync layer, S3 owns the actual bytes — each tool does what it’s good at - Content-addressed storage means versioning is space-efficient: incremental changes only store diffs at the file level (new hash = new file stored, unchanged = reused)
- DevOps engineers own the S3 bucket, IAM permissions, and DVC CLI provisioning in Airflow workers — this is infrastructure work, not data science work
- The
git checkout+dvc pullreproducibility pattern is the key value proposition: any past experiment state can be exactly reconstructed - Next step (Airflow on K8s) closes the last manual loop — currently
dvc pushstill requires a human; automating it makes the entire data pipeline fully self-versioning
Connections
Raw Excerpt
DVC stores lightweight pointer files (.dvc files) in Git and the actual data resides in a remote storage (Eg., Amazon S3). Simply put, it is the bridge between your Git repo and storage where data resides.