Summary

Guide to managing large ML artifacts (datasets, model weights) in version control. Explains why Git fails for large files (repository bloat, performance, platform limits), the integration vs. separation dilemma, and how Git LFS and DVC serve as middleware solutions — extending Git’s interface while storing large files in object storage.

關於在版本控制中管理大型 ML 工件(數據集、模型權重)的指南。解釋了 Git 為何不適合大型文件(倉庫膨脹、性能下降、平台限制),整合與分離的兩難困境,以及 Git LFS 和 DVC 如何作為中間件解決方案,在保持 Git 介面的同時將大型文件存儲在對象存儲中。

Key Points

  • Git fails for large ML files: every version stored in history; GitHub blocks files >100MB, warns >50MB; clone times balloon
  • Integration approach (everything in one repo): atomic commits, simple workflows, easier reproducibility — but hits Git limits
  • Separation approach (object storage + metadata): solves limits but creates version synchronization problems
  • Git LFS (2015, by GitHub): stores large files in external storage; Git tracks pointers; deeply integrated with GitHub/GitLab
  • DVC (Data Version Control): Git-like abstraction for data; supports S3/GCS/Azure backends; designed specifically for ML workflows with pipelines
  • Key difference: DVC has DAG-based pipeline tracking (dvc run, dvc repro); LFS is simpler — pure large file storage with no ML-specific features

Insights

The “integration vs. separation dilemma” framing is the most useful conceptual contribution. Teams consistently underestimate the version synchronization problem when separating code from data — “code commit A works with dataset version B” is harder to maintain than it sounds, especially in long-running projects with multiple contributors. DVC’s approach (track data pointers in Git, store data in object storage, version pipelines as DAGs) is the most principled solution because it maintains reproducibility as a first-class concern rather than an afterthought.

Connections

Raw Excerpt

Git’s architecture creates several specific problems when dealing with large files: Repository Bloat, Performance Degradation, Collaboration Friction, Platform Limitations, CI/CD Pipeline Inefficiency.