Summary

Lifen’s AI team (a French healthcare ML company) describes their migration from GitLab CI to Kubeflow Pipelines after scaling from 1 to 3+ ML engineers and a corresponding increase in training jobs. The move was driven by two needs: GPU scalability and experiment observability (comparing runs, tracking metrics). Their approach wraps existing Python methods in func_to_container_op, generates typed config dataclasses, and keeps GitLab as the CI trigger that calls the Kubeflow API.

法國醫療 ML 公司 Lifen 描述了將 ML 工作流從 GitLab CI 遷移到 Kubeflow Pipelines 的過程。遷移動因是 GPU 可擴展性和實驗可觀測性需求。他們保留 GitLab 作為 CI 觸發器,通過 Kubeflow API 啟動管道。

Key Points

  • Why leave Gitlab: runner scaling for GPU jobs was manual; no built-in experiment comparison or metric tracking
  • Migration pattern: wrap each Python method in func_to_container_op, compile to Argo Workflow YAML — no rewrite of core logic
  • Hybrid CI: GitLab runners installed on the Kubeflow k8s cluster; .gitlab-ci.yml reduced to ~20 lines that call pipelines.<job_name> scripts
  • Experiment organization: each branch becomes a named Kubeflow experiment (tagged with JIRA ticket number)
  • Continuous learning trigger: master branch commits auto-launch toy (short) training runs for each algorithm
  • Tradeoff acknowledged: Kubeflow documentation was weak; setup took longer than expected; MLOps ecosystem was immature at time of writing (2021)

Insights

The “use GitLab as orchestrator for Kubeflow” pattern avoids rebuilding CI from scratch while gaining Kubeflow’s experiment tracking. This hybrid is common in orgs that already have GitLab investment and don’t want to adopt a fully new CI system.

The 2021 timing matters: Kubeflow has matured significantly since then, and alternatives like MLflow, Weights & Biases, and Prefect have also grown. This article is a historical data point in MLOps tooling evolution — the pain points it describes (runner scaling, experiment comparison) are now table stakes features in most ML platforms.

此文的混合架構(GitLab 觸發 Kubeflow)可避免重建 CI 系統。2021 年時 MLOps 生態系尚不成熟,文中描述的痛點(runner 擴展、實驗比較)現已是大多數 ML 平台的標準功能。

Connections

Raw Excerpt

Most of our Gitlab pipelines were made of four consecutive jobs (fetch_dataset, build_and_fit, evaluate_model, report). In Kubeflow, things are a little bit different as everything is written in Python (a combination of pure Python and Kubeflow DSL), after which a compiler translates the Python code into an Argo Workflow.