Summary

A deep dive into Apache Airflow’s data interval concept — the time range each DAG run is responsible for processing. Explains why Airflow executes after the interval ends (data completeness guarantee), the relationship between logical_date and execution time, how start_date anchors interval calculation, and how data intervals enable backfilling and idempotent reruns.

深入解析 Apache Airflow 數據區間概念——每個 DAG 運行負責處理的時間範圍。解釋為何 Airflow 在區間結束後才執行(數據完整性保證)、logical_date 與執行時間的關係、start_date 如何錨定區間計算,以及數據區間如何支持回填和冪等重跑。

Key Points

  • Data interval = the “what to process”: each DAG run owns a specific [interval_start, interval_end) — the logical_date is the interval start, not the execution time
  • Runs execute AFTER interval ends: a daily DAG scheduled for Jan 1 runs at midnight Jan 2 — ensures no incomplete data from the interval is processed
  • start_date is the interval anchor: changing start_date changes how all historical intervals are calculated; use a fixed, immutable start_date for reproducibility
  • Idempotency: because each run is tied to a specific, immutable time range, retrying a failed run or backfilling is safe — the same interval always produces the same result
  • Backfilling: airflow dags backfill -s <start> -e <end> creates historical runs for intervals that were missed or need reprocessing — only possible because intervals are deterministic
  • data_interval_start vs execution_date: Airflow 2.2+ formalized the distinction; tasks should use data_interval_start/data_interval_end rather than the deprecated execution_date for clarity

Insights

The most common confusion for new Airflow users is why their daily DAG “runs a day late” — understanding that the run executes after the interval ends (not at interval start) resolves this. The design is intentional: if you’re processing “today’s logs,” you want all of today’s logs to exist before running, which means running at midnight tomorrow.

The idempotency guarantee is the most important production principle: if a task fails at 2pm and you retry it at 3pm, it should produce the same result. This only holds if tasks read from data_interval_start/end rather than datetime.now().

Connections

Raw Excerpt

Data intervals provide a structured way to ensure that DAGs process only the data they are meant to handle, enable accurate backfilling and replaying of workflows, and support idempotency by ensuring that each DAG run processes data for a clearly defined and immutable time range.