Summary

Netflix Engineering’s description of their EVCache cache warming system, built to handle petabytes of cached data across large fleet changes. Two warming strategies: replica warmer (copy from existing warm replica) and instance warmer (load from S3 dump). Architecture uses Controller, Dumper, and Populator components with Memcached’s LRU crawler for efficient key enumeration.

Netflix 工程團隊關於其 EVCache 緩存預熱系統的描述,用於處理大型機群變更中的 PB 級緩存數據。兩種預熱策略:副本預熱器(從現有熱副本複製)和實例預熱器(從 S3 轉儲加載)。架構使用 Controller、Dumper 和 Populator 組件,利用 Memcached 的 LRU 爬蟲高效枚舉鍵。

Key Points

  • Problem: cold cache after fleet scaling/replacement causes massive origin load spike; at Netflix’s scale (petabytes), this is catastrophic
  • Two warming strategies: (1) Replica warmer: copy key-value data from existing warm cache nodes (fast, but requires a warm source); (2) Instance warmer: dump keys to S3, then populate from S3 (works from cold start)
  • Memcached LRU crawler: used to enumerate all keys in a cache instance efficiently without disrupting serving
  • Controller/Dumper/Populator architecture: Controller orchestrates the workflow, Dumper reads keys from source cache → S3, Populator fetches from origin or S3 → loads new cache nodes
  • Scale: system handles petabytes of cached data across thousands of Memcached instances
  • Agility goal: enable rapid fleet changes (deploys, scaling, region migrations) without waiting hours for organic cache warm-up

Insights

Cache warming is an often-overlooked operational requirement that only becomes visible at scale. The replica warmer / instance warmer split mirrors a general pattern: fast path (copy from existing state) vs. reliable path (rebuild from ground truth). The Memcached LRU crawler usage is clever — it exposes the cache’s own view of what keys are hot, which is exactly what you want to preserve during migration. For services where cache miss rate directly translates to user-facing latency (video streaming), warm-up time is a hard operational constraint that bounds how fast you can deploy or scale.

Connections

Raw Excerpt

Without cache warming, a cold cache after a fleet replacement would cause a thundering herd of requests to our origin services — at our scale, this would be catastrophic. Cache warming lets us move fast without breaking things.