ia-blackout-pulse
Live tracker of US local-news outlets that have started blocking the Internet Archive — with per-outlet robots.txt evidence, a US choropleth map, a weekly diff feed, sortable table, CSV export, and shareable per-outlet "archive blackout" cards.
In May 2026 Nieman Lab named 340+ outlets but published no list, no timeline, no methodology artifact. This service reproduces the work and keeps it current.
What it does
Three real data sources, fetched at runtime, refreshed by node-cron:
| Source | URL | Refresh interval |
|---|---|---|
| Nieman Lab article (provenance) | https://www.niemanlab.org/2026/05/more-than-340-local-news-outlets-are-limiting-the-internet-archives-access-to-their-journalism/ | weekly — 0 4 0 (Sun 04:00) |
| Linked methodology dataset | https://raw.githubusercontent.com/palewire/news-homepages/main/newshomepages/sources/sites.csv — cited inline by the article as the "database of 1,167 news websites" maintained by Ben Welsh at palewi.re/docs/news-homepages | weekly (alongside the Nieman fetch) |
| Outlet /robots.txt | https://{domain}/robots.txt — one fetch per US outlet | weekly, staggered ~0.4 s between fetches |
| Wayback Machine CDX API | http://web.archive.org/cdx/search/cdx?url={domain}&output=json&limit=-1&filter=statuscode:200&fl=timestamp,original&from=20200101 | weekly, ~1.1 s between fetches |
User-Agent on every outbound fetch: ia-blackout-pulse/1.0 (+https://holyai.me/ia-blackout-pulse; research). Timeout 15 s, no retries within a run.
Classification rules
Each outlet's /robots.txt is parsed (RFC 9309-ish). Verdict:
- blocked — explicit
Disallow: /foria_archiverorarchive.org_bot, OR a wildcard*group withDisallow: /and no IA-specific override. - partial — there is an IA-specific
User-agentgroup with non-/Disallowrules (IA explicitly named, but not entirely blocked). - open —
robots.txtfetched successfully, no IA-targeted restriction. - unreachable — fetch failed, timed out, or returned an HTML error page.
The Wayback CDX last-snapshot timestamp approximates when the blackout started.
Endpoints
Everything is mounted under /ia-blackout-pulse. No auth.
### Public pages
- GET /ia-blackout-pulse/ — map + sortable, searchable, filter-chip-driven table.
- GET /ia-blackout-pulse/feed — reverse-chrono diff feed of newly detected blackouts, with "Copy as Markdown" for newsletters.
- GET /ia-blackout-pulse/outlet/:slug — per-outlet detail page: status history, raw robots.txt body with offending lines highlighted, deep Wayback link, "get shareable card" button.
- GET /ia-blackout-pulse/card/:slug — 1200×630 shareable card with full OG / Twitter meta tags. Designed to render in link previews on Bluesky / X / Mastodon. Includes a "Copy embed" snippet.
- GET /ia-blackout-pulse/map — standalone fullscreen choropleth.
- GET /ia-blackout-pulse/about — methodology disclosure.
### JSON / CSV API
- GET /ia-blackout-pulse/health — {ok:true}.
- GET /ia-blackout-pulse/api/outlets?q=&state=&status=&sort=&dir=&limit=&offset= — outlet list with server-side filtering, sorting, pagination.
- GET /ia-blackout-pulse/api/outlets/:slug — one outlet with the last 50 status_history rows attached.
- GET /ia-blackout-pulse/api/robots/:slug — the most recent raw robots.txt body we captured for that outlet, its SHA-256, the HTTP status we saw, and the parsed verdict. Use this to audit our classification.
- GET /ia-blackout-pulse/api/feed?limit= — entries ordered by blackout_detected_at desc.
- GET /ia-blackout-pulse/api/stats — totals, per-status counts, by_state breakdown ({total, blocked, partial, open, unreachable} per state), last-run metadata.
- GET /ia-blackout-pulse/api/runs — cron-run history for transparency.
- POST /ia-blackout-pulse/api/run — manually trigger a full run (no auth; idempotent — refuses with HTTP 409 if a run is already in progress).
- GET /ia-blackout-pulse/export.csv — full table dump.
Schema
SQLite (better-sqlite3, WAL mode). See db.js for the exact CREATE statements. Tables:
outlets— canonical roster, one row per unique domain.current_statusis a derived cache of the lateststatus_historyrow for that outlet.status_history— one row per (outlet, run). Stores the rawrobots.txtbody (capped 64 KB), its SHA-256, the parsed verdict, the Wayback CDX last-snapshot timestamp, HTTP status, and any error message.runs— one row per scrape run. Counters for outlets seen / added / newly-blocked / unreachable, plus error totals.roster_snapshots— raw Nieman Lab HTML preserved per fetch for provenance.
Cron schedule
0 4 0(Sun 04:00 local) — full run: roster fetch +robots.txtprobe + Wayback CDX probe.0 5 *(daily 05:00) — recomputedays_dark = now - last_snapshot_atfor every outlet with a snapshot.
On boot, if outlets is empty, the server kicks off one async bootstrap run so the first deploy lands with data.
Run locally
npm install
PORT=4846 node server.js
# Server logs:
# ia-blackout-pulse listening on :4846/ia-blackout-pulse (development)
# [bootstrap] outlets table empty — kicking off initial full run
# Try it:
curl http://localhost:4846/ia-blackout-pulse/health
curl 'http://localhost:4846/ia-blackout-pulse/api/stats'
curl 'http://localhost:4846/ia-blackout-pulse/api/outlets?status=blocked&limit=10'
open http://localhost:4846/ia-blackout-pulse/
```
Useful env vars:
| Var | Default | Purpose |
|---|---|---|
| PORT | 4846 | HTTP port |
| BASE_PATH | /ia-blackout-pulse | URL prefix |
| BOOTSTRAP_PROBE_LIMIT | 0 (=all) | Cap probes on first run (useful in dev) |
| ROBOTS_SPACING_MS | 400 | Per-iteration sleep between robots fetches |
| WAYBACK_SPACING_MS | 1100 | Used in the per-iteration sleep alongside robots spacing (CDX is ~1 req/sec rate-limited) |
| DB_PATH | ./data/ia-blackout-pulse.db | SQLite location |
Stack
Node.js ≥ 22, Express, better-sqlite3 (WAL), helmet, compression, node-cron, cheerio (HTML scrape of the Nieman article).
No LLMs. No RAG. No external paid APIs. No keys.
Honest limits
- The Nieman Lab piece names 340+ outlets but does not embed the list inline. The roster is assembled from the article's external links plus the linked methodology dataset (Ben Welsh's
news-homepagessites.csv, 1,167 outlets globally, of which ~895 are tagged country=US). We filter to US and probe weekly. - State assignment is best-effort. We resolve state from the
locationcolumn of the methodology CSV, from inline article context, and from a small built-in city → state lookup. Some outlets remainstate=NULLuntil enriched. There is no hardcoded outlet-to-state seed table. - The Wayback CDX last-snapshot is an approximation of the blackout start. A site may have stopped responding to IA earlier than its last successful snapshot if subsequent crawls happened to fail before the robots.txt change went in.
- We track
ia_archiverandarchive.org_botper spec scope. Generalized AI-crawler policy (GPTBot, ClaudeBot, etc.) is intentionally out of scope — that'scrawl-policy's remit.
License
Dataset (CSV / JSON export): CC0. Cite freely.