Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

本文由 AI 分析生成

建立時間： 2026-03-28 來源： https://www.cidrdb.org/cidr2021/papers/cidr2021-paper17.pdf

Summary

Armbrust, Ghodsi, Xin, and Zaharia (Databricks/UC Berkeley/Stanford, CIDR 2021) introduce the Lakehouse paradigm: open-format storage (Parquet) + ACID metadata layer + direct ML/BI access, arguing that it will replace the dominant two-tier Data Lake + Warehouse architecture. The paper that coined “Lakehouse.”

Armbrust、Ghodsi、Xin 和 Zaharia（Databricks/UC Berkeley/Stanford，CIDR 2021）提出了 Lakehouse 範式：開放格式存儲（Parquet）+ ACID 元數據層 + 直接 ML/BI 訪問，論證它將取代主流的兩層數據湖+倉庫架構。這是創造「Lakehouse」一詞的論文。

Prerequisites

Columnar storage formats (Parquet, ORC)
Data warehouse architecture (star/snowflake schema, BI queries)
ACID transactions and write-ahead logs
Apache Spark; cloud object stores (S3, ADLS, GCS)

Core Idea

Two-tier lake + warehouse architecture (dominant in 2020 Fortune 500) has four structural problems:

Reliability: keeping lake and warehouse consistent requires continuous ETL; ETL bugs reduce data quality
Data staleness: warehouse data is always older than lake (ETL delay); 86% of analysts use stale data
Limited ML support: neither lakes nor warehouses are ideal for ML training data access
Total cost: duplicated data + ETL compute + engineering maintenance

Lakehouse solution: add an ACID metadata layer directly over the data lake (open-format files in cloud object storage). The metadata layer provides:

Transactions: atomic multi-object updates
Schema enforcement and evolution
Time travel: query any historical version
Data management: Z-ordering, file compaction, bloom filters for fast queries
Direct ML access: TensorFlow/PyTorch can read Parquet directly

Delta Lake is the primary implementation; result: TPC-DS performance competitive with commercial cloud data warehouses.

Results

TPC-DS benchmark: Lakehouse (Delta Lake on Databricks Runtime) competitive with Redshift and Snowflake at standard scale factors
Architecture used at thousands of enterprises; industry adoption confirmed trend
Eliminates the two-tier architecture entirely for most workloads

Limitations

Author-stated:

Streaming use cases still have some latency limitations at the time of writing (2021)
Governance and data catalog integration still maturing

Unstated:

TPC-DS comparison against commercial warehouses may favor columnar optimizations; real-world mixed workloads are harder to benchmark
Metadata layer approach (Delta Lake) is proprietary-adjacent; Apache Iceberg and Hudi emerged as open alternatives

Reproducibility

Code: Delta Lake is open source at github.com/delta-io/delta
Data: TPC-DS benchmarks are reproducible with standard tooling
Context: Companion VLDB 2020 paper covers Delta Lake implementation in more detail

Insights

This paper coined “Lakehouse” and defined the concept that now underpins most modern data stacks (Databricks Delta, Snowflake Iceberg, BigLake, Polaris Catalog). The structural diagnosis — ETL between lake and warehouse is the root problem — is correct and elegant. The architecture eliminates a whole category of failure modes by collapsing two systems into one. The open-format commitment (Parquet/ORC rather than proprietary storage) is strategically important: it enables vendor interoperability and prevents lock-in, which drove Apache Iceberg adoption alongside Delta Lake.

Connections

Raw Excerpt

The key idea behind the Lakehouse architecture is simple: implement similar data structures and data management features to those in a data warehouse directly on the kind of low cost storage used for data lakes. If the data management features available on modern data lakes can match those in data warehouses, there is no longer a need for a separate warehouse copy of the data.

bot_vault

Explorer

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

Summary

Prerequisites

Core Idea

Results

Limitations

Reproducibility

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks