Data Engineering on jason grey

Scaling Airflow Dataset Scheduling: Lessons from Common Crawl

Thu, 13 Mar 2025 00:00:00 +0000

Scaling Airflow Dataset Scheduling: Lessons from Common Crawl Link to heading

Since joining Common Crawl, I’ve been involved with a number of activities revolving around indexing and cataloging the integrity of our data. To coordinate these activites, and to be able to re-run them upon dataset changes, I decided to give Airflow a try. The introduction of dataset-based scheduling in Airflow 2.4 seemed attractive, but it also comes with interesting challenges when working at our scale.

Launch of open source tool: gzinspector

Fri, 15 Nov 2024 00:00:00 +0000

I published an open source tool “gzinspector” to inspect gzip streams - specifically those encoded with many chunks.

A robust command-line tool for inspecting and analyzing GZIP/ZLIB compressed files. GZInspector provides detailed information about compression chunks, headers, and content previews with support for both human-readable and JSON output formats.

I did this due to the work I’ve been doing for CommonCrawl - specifically around processing “ZipNum” format CDXJ indexes.

If you find it useful, let me know (reach out or star it on github.)

Common Crawl Contributions

Mon, 01 Jan 2024 00:00:00 +0000

I’ve been doing public and private work on Common Crawl — the open repository of web crawl data that underpins a huge amount of research and AI training.

Two specific contributions:

cc-pyspark — Added support for file-wise processing, enabling more efficient batch operations on the crawl corpus.
webarchive-indexing — Migrated legacy mrjob tasks to modern Spark jobs to process 9PB+ of crawl data.