ia-blackout-pulse methodology

What this tracks

In May 2026, Nieman Lab published "More than 340 local news outlets are limiting the Internet Archive's access to their journalism" — but no machine-readable list, no per-outlet robots.txt, no timeline. ia-blackout-pulse reproduces the work and keeps it current.

Data sources (verbatim)

SourceURLRefreshOn failure
Nieman Lab article niemanlab.org / 2026 / 05 / more-than-340-local-news-outlets… weekly — 0 4 * * 0 (Sun 04:00 UTC) Log to runs with error_msg; keep last successful roster active.
Outlet robots.txt https://{domain}/robots.txt — one HTTP fetch per outlet weekly, staggered ~4s per outlet to avoid hammering Mark robots_status='unreachable', retain previous classification, increment fail_streak.
Wayback Machine CDX http://web.archive.org/cdx/search/cdx?url={domain}&output=json&limit=-1&filter=statuscode:200&fl=timestamp,original&from=20200101 weekly, after the robots probe (staggered ~2s) Mark last_snapshot_at=NULL if no rows; do not overwrite previous value on 5xx.

All three sources are public, no API key required. Every "live" claim on the site traces back to one of these rows. User-Agent on every outbound fetch: ia-blackout-pulse/1.0 (+https://holyai.me/ia-blackout-pulse; research), 15-second timeout, no retries within a single run.

How we classify

We parse each outlet's /robots.txt and look for groups whose User-agent matches one of:

Each agent's group is then evaluated:

An outlet's last Wayback snapshot date approximates when the blackout began. If the snapshot is older than a few weeks, the outlet has likely been blocking Internet Archive for that long.

Why might my outlet be misclassified?

Every classification ships with its evidence. On any outlet detail page (e.g. /outlet/<slug>) you can read the raw robots.txt body we fetched, the timestamp we fetched it, the SHA-256 of the body, and the precise lines that triggered the verdict. If you believe the call is wrong, that page tells you exactly what we saw, so you can:

What this is not

We do not track non-US outlets, generalized AI-crawler policies (that's a different problem — see crawl-policy), Internet Archive infrastructure health (see archive-health), legal/policy commentary, or recommendations. We just publish the structured data Nieman didn't.

License

The dataset (CSV / JSON) is CC0 — cite it freely. The site is no-auth, no tracking, no analytics.