Scaling Airflow Dataset Scheduling: Lessons from Common Crawl

Thu, 13 Mar 2025 00:00:00 +0000

Scaling Airflow Dataset Scheduling: Lessons from Common Crawl Link to heading

Since joining Common Crawl, I’ve been involved with a number of activities revolving around indexing and cataloging the integrity of our data. To coordinate these activites, and to be able to re-run them upon dataset changes, I decided to give Airflow a try. The introduction of dataset-based scheduling in Airflow 2.4 seemed attractive, but it also comes with interesting challenges when working at our scale.

Apache on jason grey

Scaling Airflow Dataset Scheduling: Lessons from Common Crawl

Scaling Airflow Dataset Scheduling: Lessons from Common Crawl Link to heading