Summary

Armbrust et al. (Databricks/Stanford/Berkeley, VLDB 2020) present Delta Lake, an open-source ACID table storage layer over cloud object stores. Uses a write-ahead log stored in Parquet format to provide transactions, time travel, and fast metadata operations over S3/Azure Blob — without running a dedicated metadata server.

Armbrust 等人(Databricks/Stanford/Berkeley,VLDB 2020)介紹了 Delta Lake,一個開源的 ACID 表存儲層,建立在雲對象存儲之上。使用 Parquet 格式存儲的預寫日誌,無需運行專用元數據服務器,即可在 S3/Azure Blob 上提供事務、時間旅行和快速元數據操作。

Prerequisites

  • Cloud object stores (S3, Azure Blob) and their key-value store consistency model
  • Apache Parquet columnar file format
  • ACID transactions and write-ahead logs (WAL)
  • Apache Spark data processing

Core Idea

Cloud object stores lack cross-key atomicity, making it impossible to update multiple files consistently. Delta Lake solves this by maintaining a transaction log (stored in the object store itself) that records which Parquet files belong to a table at each point in time. All mutations go through optimistic concurrency control against the log — readers always see a consistent snapshot, writers resolve conflicts via the log. Because all metadata is in the object store (no separate server), compute and storage scale independently.

Key features enabled by the log:

  • ACID transactions: multi-object updates are atomic
  • Time travel: query any historical table snapshot via the log
  • UPSERT/DELETE/MERGE: rewrite affected Parquet files transactionally
  • Streaming I/O: low-latency small writes coalesced later by compaction
  • Fast metadata: min/max statistics in the log enable partition pruning without touching every file footer

Results

  • Deployed at thousands of Databricks customers processing exabytes/day
  • Reduced cloud storage–related support escalations from ~50% to nearly zero
  • Query speedups up to 100x for high-dimensional datasets (network security, bioinformatics) via data layout optimization and fast statistics access
  • Supports Apache Spark, Hive, Presto, Redshift, Snowflake connectors

Limitations

Author-stated:

  • Optimistic concurrency can cause write conflicts under high-write contention (though mitigations exist)
  • Small-object performance depends on periodic compaction

Unstated:

  • The transaction log can grow large for high-churn tables; log compaction is needed
  • No support for true cross-table transactions
  • Performance heavily dependent on Spark ecosystem; non-Spark connectors are less mature

Reproducibility

  • Code: Open source at github.com/delta-io/delta
  • Data: Production workloads at Databricks; no public benchmark dataset
  • Compute: Not applicable (service-level evaluation)

Insights

The core insight — store the transaction log inside the same object store as the data — elegantly sidesteps the need for a metadata service while enabling ACID semantics. This is architecturally clever: it leverages the one consistency primitive that object stores do provide (atomic single-object PUT) to build multi-object atomicity. The “lakehouse” framing (combining data lake cost with data warehouse features) became influential; Delta Lake pioneered the pattern that Apache Iceberg and Apache Hudi followed. The ~50% → ~0% support ticket reduction is a striking operational result that validates the practical value of transactional storage.

Connections

Raw Excerpt

The core idea of Delta Lake is simple: we maintain information about which objects are part of a Delta table in an ACID manner, using a write-ahead log that is itself stored in the cloud object store. This means that no servers need to be running to maintain state for a Delta table; users only need to launch servers when running queries.