Last updated on

Scraping the NYT


EDIT: This method has been patched.

AI models require a lot of data to train. As a result, the truly bright conclusion AI companies came to, is to indiscriminately scrap the entire internet. /s

Evidently this came with a whole host of problems, and it especially angered publishers. I am not going to comment on the morality of what defines “fair use”, at least for this post. But given numerous ongoing lawsuits the New York Times has, regarding the unauthorized use of their content, they should have strong scraping protections (foreshadowing). So as a fun weekend project, I wanted to see how difficult it would be to scrape the NYT.

Investigation

Let’s first look at the robots.txt for some clues.

# Sitemaps

Sitemap: https://www.nytimes.com/sitemaps/new/news.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/sitemap.xml.gz

Hmm, the only item of interest is the sitemap. Let’s switch gears and look at an article and its potential bot protection. For those following along at home:

curl "https://www.nytimes.com/2025/07/09/technology/linda-yaccarino-x-steps-down.html"

The plot thickens…

<script>
(function(){
  if (document.cookie.indexOf('NYT-S') === -1) {
    var iframe = document.createElement('iframe');
    iframe.height = 0;
    iframe.width = 0;
    iframe.style.display = 'none';
    iframe.style.visibility = 'hidden';
    iframe.src = 'https://myaccount.nytimes.com/auth/prefetch-assets';
    document.body.appendChild(iframe);
  }
})();
</script>

What if we pass a value for that NYT-S cookie…

curl "https://www.nytimes.com/2025/07/09/technology/linda-yaccarino-x-steps-down.html" -b "NYT-S=enroneurope"

Huh, it really was that easy.

X Chief Says She Is Leaving the Social Media Platform

Code

Here is a quick crawler, I whipped up:

// main.ts
import { CheerioCrawler, Sitemap, ProxyConfiguration } from 'crawlee';

import { router } from './routes.js';

const MAX_CONCURRENCY = 1;
const MAX_REQUESTS_PER_MINUTE = 500;
const MAX_REQUEST_RETRIES = 10;

const curlHeaders = {
    'Accept': '*/*',
    'User-Agent': 'curl/8.14.1', // Important
};

const crawler = new CheerioCrawler({
  // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
  requestHandler: router,
  // maxRequestsPerCrawl: 10,
  maxConcurrency: MAX_CONCURRENCY,
  maxRequestsPerMinute: MAX_REQUESTS_PER_MINUTE,
  maxRequestRetries: MAX_REQUEST_RETRIES,
  preNavigationHooks: [
    (crawlingContext, gotOptions) => {
      gotOptions.headers = {
        ...gotOptions.headers,
        ...curlHeaders,
        Cookie: 'NYT-S=enroneurope;',
      };
    }
  ],

  async failedRequestHandler({ request, log, error }) {
    log.error(`Request ${request.url} failed too many times.`, { error });
  },
});

const { urls } = await Sitemap.load('https://www.nytimes.com/sitemaps/new/sitemap-2025-07.xml.gz');

await crawler.addRequests(urls);

await crawler.run();
// router.ts
import { createCheerioRouter } from 'crawlee';
import { parseHTML } from 'linkedom';
import { Readability } from '@mozilla/readability';
import DOMPurify from 'dompurify';

export const router = createCheerioRouter();

const { window } = parseHTML('<html></html>');
const dompurify = DOMPurify(window);

router.addDefaultHandler(async ({ request, body, log, pushData }) => {
  const { document } = parseHTML(body);

  const reader = new Readability(document);
  const article = reader.parse();

  if (article) {
    log.info(`${article.title}`, { url: request.loadedUrl });

    const clean = dompurify.sanitize(article.content);

    await pushData({
      url: request.loadedUrl,
      title: article.title,
      byline: article.byline,
      content_html: clean,
      content_text: article.textContent,
      length: article.length,
    });
  } else {
    log.info('Readability.js could not parse article', { url: request.loadedUrl });
    await pushData({
      url: request.loadedUrl,
      title: 'N/A',
      byline: 'N/A',
      content_html: 'N/A',
      content_text: 'N/A',
      length: 'N/A',
    });
  }
});