What this tracks
In May 2026, Nieman Lab published "More than 340 local news outlets are limiting the Internet Archive's access to their journalism" — but no machine-readable list, no per-outlet robots.txt, no timeline. ia-blackout-pulse reproduces the work and keeps it current.
Data sources (verbatim)
| Source | URL | Refresh | On failure |
|---|---|---|---|
| Nieman Lab article | niemanlab.org / 2026 / 05 / more-than-340-local-news-outlets… |
weekly — 0 4 * * 0 (Sun 04:00 UTC) |
Log to runs with error_msg; keep last successful roster active. |
| Outlet robots.txt | https://{domain}/robots.txt — one HTTP fetch per outlet |
weekly, staggered ~4s per outlet to avoid hammering | Mark robots_status='unreachable', retain previous classification, increment fail_streak. |
| Wayback Machine CDX | http://web.archive.org/cdx/search/cdx?url={domain}&output=json&limit=-1&filter=statuscode:200&fl=timestamp,original&from=20200101 |
weekly, after the robots probe (staggered ~2s) | Mark last_snapshot_at=NULL if no rows; do not overwrite previous value on 5xx. |
All three sources are public, no API key required. Every "live" claim on the site traces back to one of these rows. User-Agent on every outbound fetch: ia-blackout-pulse/1.0 (+https://holyai.me/ia-blackout-pulse; research), 15-second timeout, no retries within a single run.
How we classify
We parse each outlet's /robots.txt and look for groups whose User-agent matches one of:
ia_archiver— Internet Archive's classic crawler.archive.org_bot— current IA crawler identifier.*(wildcard) — applied only if no IA-specific group exists.
Each agent's group is then evaluated:
- blocked — explicit
Disallow: /for that agent (entire site). - partial —
Disallowrules exist but only on specific paths. - open —
robots.txtfetched successfully, no relevant Disallow. - unreachable — fetch failed, timed out, or returned an HTML error page.
An outlet's last Wayback snapshot date approximates when the blackout began. If the snapshot is older than a few weeks, the outlet has likely been blocking Internet Archive for that long.
Why might my outlet be misclassified?
Every classification ships with its evidence. On any outlet detail page (e.g. /outlet/<slug>) you can read the raw robots.txt body we fetched, the timestamp we fetched it, the SHA-256 of the body, and the precise lines that triggered the verdict. If you believe the call is wrong, that page tells you exactly what we saw, so you can:
- Compare against the live
robots.txtat the same URL. - Check whether your CDN serves a different robots to non-browser User-Agents (some do).
- Open an issue with the SHA of our cached body so we can re-probe sooner.
What this is not
We do not track non-US outlets, generalized AI-crawler policies (that's a different
problem — see crawl-policy), Internet Archive infrastructure health
(see archive-health), legal/policy commentary, or recommendations. We just
publish the structured data Nieman didn't.
License
The dataset (CSV / JSON) is CC0 — cite it freely. The site is no-auth, no tracking, no analytics.