Summary

Kakashi’s comprehensive notes on caching in distributed systems, covering when to use cache, types (local vs. external), and the three classic failure modes: cache penetration (misses going through to DB), cache avalanche (mass simultaneous expiry), and cache stampede (thundering herd). Solutions for each failure mode are documented, drawn partly from AWS’s caching challenges article.

Kakashi 整理分散式系統中的快取筆記,涵蓋何時使用快取、快取種類(本地 vs. 外部),以及三種經典失敗模式:快取穿透、快取雪崩和快取擊穿(驚群問題),並記錄各種失敗模式的解決方案。

Key Points

  • Local cache: in-process memory, simple but inconsistent across nodes and has cold start on reboot
  • External cache (Redis/Memcached): solves cross-node consistency but adds complexity, monitoring needs, and availability concerns
  • Cache penetration: many misses hit DB directly — solutions: cache empty results, bloom filter, validate key format
  • Cache avalanche: mass cache expiry simultaneously — solutions: randomize TTL, use circuit breaker, warm-up before cutover
  • Cache stampede (dog-pile effect): multiple requests simultaneously fetch the same missing key — solutions: mutex lock, probabilistic early expiry
  • High miss rate from caching can actually increase latency vs. not caching

Insights

The insight that “adding cache only to have a low hit rate makes latency worse, not better” is counterintuitive but important for capacity planning. Every cache layer is a system component that requires monitoring, failover planning, and management — the decision to cache should be driven by measured access patterns, not instinct. The bloom filter for penetration prevention is an elegant solution: it’s probabilistic (may have false positives) but false positives only cause unnecessary DB hits, not incorrect results.

Connections

Raw Excerpt

如果 Cache 整體 hit rate 不高,加入 Cache 也許只是讓 latency 更高,在伴隨著帶來的 Cache Availability, Cache Coherence 和 Cache Invalidation 的問題,都是我們要考慮有沒有必要使用 Cache 的關鍵。