<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Common Crawl on jason grey</title><link>https://jason-grey.com/tags/common-crawl/</link><description>Recent content in Common Crawl on jason grey</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 13 Mar 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://jason-grey.com/tags/common-crawl/index.xml" rel="self" type="application/rss+xml"/><item><title>Scaling Airflow Dataset Scheduling: Lessons from Common Crawl</title><link>https://jason-grey.com/posts/2025/airflow-at-scale/</link><pubDate>Thu, 13 Mar 2025 00:00:00 +0000</pubDate><guid>https://jason-grey.com/posts/2025/airflow-at-scale/</guid><description>&lt;h1 id="scaling-airflow-dataset-scheduling-lessons-from-common-crawl"&gt;
 Scaling Airflow Dataset Scheduling: Lessons from Common Crawl
 &lt;a class="heading-link" href="#scaling-airflow-dataset-scheduling-lessons-from-common-crawl"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h1&gt;
&lt;p&gt;Since joining &lt;a href="https://www.commoncrawl.org/" class="external-link" target="_blank" rel="noopener"&gt;Common Crawl&lt;/a&gt;, I&amp;rsquo;ve been involved with a number of activities revolving around indexing and cataloging the integrity of our data. To coordinate these activites, and to be able to re-run them upon dataset changes, I decided to give Airflow a try. The introduction of dataset-based scheduling in Airflow 2.4 seemed attractive, but it also comes with interesting challenges when working at our scale.&lt;/p&gt;</description></item><item><title>AI and the Right To Learn on an Open Internet</title><link>https://jason-grey.com/posts/2024/right-to-learn-conference/</link><pubDate>Thu, 02 May 2024 00:00:00 +0000</pubDate><guid>https://jason-grey.com/posts/2024/right-to-learn-conference/</guid><description>&lt;p&gt;As part of my involvement with &lt;a href="https://www.commoncrawl.org" class="external-link" target="_blank" rel="noopener"&gt;Common Crawl Foundation&lt;/a&gt;, I recently attended the &amp;ldquo;&lt;a href="https://lu.ma/3g9vhzvd" class="external-link" target="_blank" rel="noopener"&gt;AI and the Right To Learn on an Open Internet: A Conversation Convened by Common Crawl Foundation and Professor Jeff Jarvis&lt;/a&gt;&amp;rdquo; in New York.&lt;/p&gt;
&lt;p&gt;Jeff and Rich ended the conference by going wide and asking the entire group of attendees for next steps.&lt;/p&gt;
&lt;p&gt;The suggestion I put forth was to pair the policy makers and lawyers with data scientists or software engineers to develop robust ways of validating whatever the policies might be.&lt;/p&gt;</description></item><item><title>Common Crawl Checker</title><link>https://jason-grey.com/posts/2024/common-crawl-checker/</link><pubDate>Tue, 06 Feb 2024 00:00:00 +0000</pubDate><guid>https://jason-grey.com/posts/2024/common-crawl-checker/</guid><description>&lt;h1 id="enter-a-hostname-see-if-common-crawl-has-it"&gt;
 Enter a hostname, see if common crawl has it
 &lt;a class="heading-link" href="#enter-a-hostname-see-if-common-crawl-has-it"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h1&gt;
&lt;p&gt;This checks &lt;a href="https://www.commoncrawl.org/blog/november-december-2023-crawl-archive-now-available" class="external-link" target="_blank" rel="noopener"&gt;CC-MAIN-2023-50&lt;/a&gt; - which was from November/December 2023. I may update this in future to check the latest, but, for now, that&amp;rsquo;s what we have.&lt;/p&gt;
&lt;p&gt;Give it a try here:&lt;/p&gt;


&lt;script&gt;
 function checkURL() {
 document.getElementById('result').textContent = '...';
 var urlToCheck = document.getElementById('urlInput').value;
 var apiUrl = 'https://api.jason-grey.com/check_url?url=' + encodeURIComponent(urlToCheck);

 fetch(apiUrl)
 .then(response =&gt; response.json())
 .then(data =&gt; {
 document.getElementById('result').textContent = data.result;
 })
 .catch(error =&gt; {
 console.error('Error:', error);
 document.getElementById('result').textContent = 'Error calling the service';
 });
 }
 &lt;/script&gt;

 &lt;input type="text" id="urlInput" placeholder="Enter domain name to check"&gt;
 &lt;button onclick="checkURL()"&gt;Check&lt;/button&gt;
 &lt;p id="result"&gt;&lt;/p&gt;</description></item></channel></rss>