What this tracks

In May 2026, Nieman Lab published "More than 340 local news outlets are limiting the Internet Archive's access to their journalism" — but no machine-readable list, no per-outlet robots.txt, no timeline. ia-blackout-pulse reproduces the work and keeps it current.

Data sources (verbatim)

Source	URL	Refresh	On failure
Nieman Lab article	`niemanlab.org / 2026 / 05 / more-than-340-local-news-outlets…`	weekly — `0 4 * * 0` (Sun 04:00 UTC)	Log to `runs` with `error_msg`; keep last successful roster active.
Outlet robots.txt	`https://{domain}/robots.txt` — one HTTP fetch per outlet	weekly, staggered ~4s per outlet to avoid hammering	Mark `robots_status='unreachable'`, retain previous classification, increment `fail_streak`.
Wayback Machine CDX	`http://web.archive.org/cdx/search/cdx?url={domain}&output=json&limit=-1&filter=statuscode:200&fl=timestamp,original&from=20200101`	weekly, after the robots probe (staggered ~2s)	Mark `last_snapshot_at=NULL` if no rows; do not overwrite previous value on 5xx.

All three sources are public, no API key required. Every "live" claim on the site traces back to one of these rows. User-Agent on every outbound fetch: ia-blackout-pulse/1.0 (+https://holyai.me/ia-blackout-pulse; research), 15-second timeout, no retries within a single run.

How we classify

We parse each outlet's /robots.txt and look for groups whose User-agent matches one of:

ia_archiver — Internet Archive's classic crawler.
archive.org_bot — current IA crawler identifier.
* (wildcard) — applied only if no IA-specific group exists.

Each agent's group is then evaluated:

blocked — explicit Disallow: / for that agent (entire site).
partial — Disallow rules exist but only on specific paths.
open — robots.txt fetched successfully, no relevant Disallow.
unreachable — fetch failed, timed out, or returned an HTML error page.

An outlet's last Wayback snapshot date approximates when the blackout began. If the snapshot is older than a few weeks, the outlet has likely been blocking Internet Archive for that long.

Why might my outlet be misclassified?

Every classification ships with its evidence. On any outlet detail page (e.g. /outlet/<slug>) you can read the raw robots.txt body we fetched, the timestamp we fetched it, the SHA-256 of the body, and the precise lines that triggered the verdict. If you believe the call is wrong, that page tells you exactly what we saw, so you can:

Compare against the live robots.txt at the same URL.
Check whether your CDN serves a different robots to non-browser User-Agents (some do).
Open an issue with the SHA of our cached body so we can re-probe sooner.

What this is not

We do not track non-US outlets, generalized AI-crawler policies (that's a different problem — see crawl-policy), Internet Archive infrastructure health (see archive-health), legal/policy commentary, or recommendations. We just publish the structured data Nieman didn't.

License

The dataset (CSV / JSON) is CC0 — cite it freely. The site is no-auth, no tracking, no analytics.