Common Crawl on jason grey

Scaling Airflow Dataset Scheduling: Lessons from Common Crawl

Thu, 13 Mar 2025 00:00:00 +0000

Scaling Airflow Dataset Scheduling: Lessons from Common Crawl Link to heading

Since joining Common Crawl, I’ve been involved with a number of activities revolving around indexing and cataloging the integrity of our data. To coordinate these activites, and to be able to re-run them upon dataset changes, I decided to give Airflow a try. The introduction of dataset-based scheduling in Airflow 2.4 seemed attractive, but it also comes with interesting challenges when working at our scale.

AI and the Right To Learn on an Open Internet

Thu, 02 May 2024 00:00:00 +0000

As part of my involvement with Common Crawl Foundation, I recently attended the “AI and the Right To Learn on an Open Internet: A Conversation Convened by Common Crawl Foundation and Professor Jeff Jarvis” in New York.

Jeff and Rich ended the conference by going wide and asking the entire group of attendees for next steps.

The suggestion I put forth was to pair the policy makers and lawyers with data scientists or software engineers to develop robust ways of validating whatever the policies might be.

Common Crawl Checker

Tue, 06 Feb 2024 00:00:00 +0000

Enter a hostname, see if common crawl has it Link to heading

This checks CC-MAIN-2023-50 - which was from November/December 2023. I may update this in future to check the latest, but, for now, that’s what we have.

Give it a try here: