Summary

Armbrust, Ghodsi, Xin, and Zaharia (Databricks/UC Berkeley/Stanford, CIDR 2021) introduce the Lakehouse paradigm: open-format storage (Parquet) + ACID metadata layer + direct ML/BI access, arguing that it will replace the dominant two-tier Data Lake + Warehouse architecture. The paper that coined “Lakehouse.”

Armbrust、Ghodsi、Xin 和 Zaharia(Databricks/UC Berkeley/Stanford,CIDR 2021)提出了 Lakehouse 範式:開放格式存儲(Parquet)+ ACID 元數據層 + 直接 ML/BI 訪問,論證它將取代主流的兩層數據湖+倉庫架構。這是創造「Lakehouse」一詞的論文。

Prerequisites

  • Columnar storage formats (Parquet, ORC)
  • Data warehouse architecture (star/snowflake schema, BI queries)
  • ACID transactions and write-ahead logs
  • Apache Spark; cloud object stores (S3, ADLS, GCS)

Core Idea

Two-tier lake + warehouse architecture (dominant in 2020 Fortune 500) has four structural problems:

  1. Reliability: keeping lake and warehouse consistent requires continuous ETL; ETL bugs reduce data quality
  2. Data staleness: warehouse data is always older than lake (ETL delay); 86% of analysts use stale data
  3. Limited ML support: neither lakes nor warehouses are ideal for ML training data access
  4. Total cost: duplicated data + ETL compute + engineering maintenance

Lakehouse solution: add an ACID metadata layer directly over the data lake (open-format files in cloud object storage). The metadata layer provides:

  • Transactions: atomic multi-object updates
  • Schema enforcement and evolution
  • Time travel: query any historical version
  • Data management: Z-ordering, file compaction, bloom filters for fast queries
  • Direct ML access: TensorFlow/PyTorch can read Parquet directly

Delta Lake is the primary implementation; result: TPC-DS performance competitive with commercial cloud data warehouses.

Results

  • TPC-DS benchmark: Lakehouse (Delta Lake on Databricks Runtime) competitive with Redshift and Snowflake at standard scale factors
  • Architecture used at thousands of enterprises; industry adoption confirmed trend
  • Eliminates the two-tier architecture entirely for most workloads

Limitations

Author-stated:

  • Streaming use cases still have some latency limitations at the time of writing (2021)
  • Governance and data catalog integration still maturing

Unstated:

  • TPC-DS comparison against commercial warehouses may favor columnar optimizations; real-world mixed workloads are harder to benchmark
  • Metadata layer approach (Delta Lake) is proprietary-adjacent; Apache Iceberg and Hudi emerged as open alternatives

Reproducibility

  • Code: Delta Lake is open source at github.com/delta-io/delta
  • Data: TPC-DS benchmarks are reproducible with standard tooling
  • Context: Companion VLDB 2020 paper covers Delta Lake implementation in more detail

Insights

This paper coined “Lakehouse” and defined the concept that now underpins most modern data stacks (Databricks Delta, Snowflake Iceberg, BigLake, Polaris Catalog). The structural diagnosis — ETL between lake and warehouse is the root problem — is correct and elegant. The architecture eliminates a whole category of failure modes by collapsing two systems into one. The open-format commitment (Parquet/ORC rather than proprietary storage) is strategically important: it enables vendor interoperability and prevents lock-in, which drove Apache Iceberg adoption alongside Delta Lake.

Connections

Raw Excerpt

The key idea behind the Lakehouse architecture is simple: implement similar data structures and data management features to those in a data warehouse directly on the kind of low cost storage used for data lakes. If the data management features available on modern data lakes can match those in data warehouses, there is no longer a need for a separate warehouse copy of the data.