Common Crawl Contributions

Mon, 01 Jan 2024 00:00:00 +0000

I’ve been doing public and private work on Common Crawl — the open repository of web crawl data that underpins a huge amount of research and AI training.

Two specific contributions:

cc-pyspark — Added support for file-wise processing, enabling more efficient batch operations on the crawl corpus.
webarchive-indexing — Migrated legacy mrjob tasks to modern Spark jobs to process 9PB+ of crawl data.

Commoncrawl on jason grey

Common Crawl Contributions