ia-blackout-pulse

Live tracker of US local-news outlets that have started blocking the Internet Archive — with per-outlet robots.txt evidence, a US choropleth map, a weekly diff feed, sortable table, CSV export, and shareable per-outlet "archive blackout" cards.

In May 2026 Nieman Lab named 340+ outlets but published no list, no timeline, no methodology artifact. This service reproduces the work and keeps it current.

What it does

Three real data sources, fetched at runtime, refreshed by node-cron:

| Source | URL | Refresh interval |
|---|---|---|
| Nieman Lab article (provenance) | https://www.niemanlab.org/2026/05/more-than-340-local-news-outlets-are-limiting-the-internet-archives-access-to-their-journalism/ | weekly — 0 4 0 (Sun 04:00) |
| Linked methodology dataset | https://raw.githubusercontent.com/palewire/news-homepages/main/newshomepages/sources/sites.csv — cited inline by the article as the "database of 1,167 news websites" maintained by Ben Welsh at palewi.re/docs/news-homepages | weekly (alongside the Nieman fetch) |
| Outlet /robots.txt | https://{domain}/robots.txt — one fetch per US outlet | weekly, staggered ~0.4 s between fetches |
| Wayback Machine CDX API | http://web.archive.org/cdx/search/cdx?url={domain}&output=json&limit=-1&filter=statuscode:200&fl=timestamp,original&from=20200101 | weekly, ~1.1 s between fetches |

User-Agent on every outbound fetch: ia-blackout-pulse/1.0 (+https://holyai.me/ia-blackout-pulse; research). Timeout 15 s, no retries within a run.

Classification rules

Each outlet's /robots.txt is parsed (RFC 9309-ish). Verdict:

blocked — explicit Disallow: / for ia_archiver or archive.org_bot, OR a wildcard * group with Disallow: / and no IA-specific override.
partial — there is an IA-specific User-agent group with non-/ Disallow rules (IA explicitly named, but not entirely blocked).
open — robots.txt fetched successfully, no IA-targeted restriction.
unreachable — fetch failed, timed out, or returned an HTML error page.

The Wayback CDX last-snapshot timestamp approximates when the blackout started.

Endpoints

Everything is mounted under /ia-blackout-pulse. No auth.

### Public pages
- GET /ia-blackout-pulse/ — map + sortable, searchable, filter-chip-driven table.
- GET /ia-blackout-pulse/feed — reverse-chrono diff feed of newly detected blackouts, with "Copy as Markdown" for newsletters.
- GET /ia-blackout-pulse/outlet/:slug — per-outlet detail page: status history, raw robots.txt body with offending lines highlighted, deep Wayback link, "get shareable card" button.
- GET /ia-blackout-pulse/card/:slug — 1200×630 shareable card with full OG / Twitter meta tags. Designed to render in link previews on Bluesky / X / Mastodon. Includes a "Copy embed" snippet.
- GET /ia-blackout-pulse/map — standalone fullscreen choropleth.
- GET /ia-blackout-pulse/about — methodology disclosure.

### JSON / CSV API
- GET /ia-blackout-pulse/health — {ok:true}.
- GET /ia-blackout-pulse/api/outlets?q=&state=&status=&sort=&dir=&limit=&offset= — outlet list with server-side filtering, sorting, pagination.
- GET /ia-blackout-pulse/api/outlets/:slug — one outlet with the last 50 status_history rows attached.
- GET /ia-blackout-pulse/api/robots/:slug — the most recent raw robots.txt body we captured for that outlet, its SHA-256, the HTTP status we saw, and the parsed verdict. Use this to audit our classification.
- GET /ia-blackout-pulse/api/feed?limit= — entries ordered by blackout_detected_at desc.
- GET /ia-blackout-pulse/api/stats — totals, per-status counts, by_state breakdown ({total, blocked, partial, open, unreachable} per state), last-run metadata.
- GET /ia-blackout-pulse/api/runs — cron-run history for transparency.
- POST /ia-blackout-pulse/api/run — manually trigger a full run (no auth; idempotent — refuses with HTTP 409 if a run is already in progress).
- GET /ia-blackout-pulse/export.csv — full table dump.

Schema

SQLite (better-sqlite3, WAL mode). See db.js for the exact CREATE statements. Tables:

outlets — canonical roster, one row per unique domain. current_status is a derived cache of the latest status_history row for that outlet.
status_history — one row per (outlet, run). Stores the raw robots.txt body (capped 64 KB), its SHA-256, the parsed verdict, the Wayback CDX last-snapshot timestamp, HTTP status, and any error message.
runs — one row per scrape run. Counters for outlets seen / added / newly-blocked / unreachable, plus error totals.
roster_snapshots — raw Nieman Lab HTML preserved per fetch for provenance.

Cron schedule

0 4 0 (Sun 04:00 local) — full run: roster fetch + robots.txt probe + Wayback CDX probe.
0 5 * (daily 05:00) — recompute days_dark = now - last_snapshot_at for every outlet with a snapshot.

On boot, if outlets is empty, the server kicks off one async bootstrap run so the first deploy lands with data.

Run locally

npm install
PORT=4846 node server.js
# Server logs:
#   ia-blackout-pulse listening on :4846/ia-blackout-pulse (development)
#   [bootstrap] outlets table empty — kicking off initial full run

# Try it:
curl http://localhost:4846/ia-blackout-pulse/health
curl 'http://localhost:4846/ia-blackout-pulse/api/stats'
curl 'http://localhost:4846/ia-blackout-pulse/api/outlets?status=blocked&limit=10'
open http://localhost:4846/ia-blackout-pulse/
```

Useful env vars:

| Var | Default | Purpose |
|---|---|---|
| PORT | 4846 | HTTP port |
| BASE_PATH | /ia-blackout-pulse | URL prefix |
| BOOTSTRAP_PROBE_LIMIT | 0 (=all) | Cap probes on first run (useful in dev) |
| ROBOTS_SPACING_MS | 400 | Per-iteration sleep between robots fetches |
| WAYBACK_SPACING_MS | 1100 | Used in the per-iteration sleep alongside robots spacing (CDX is ~1 req/sec rate-limited) |
| DB_PATH | ./data/ia-blackout-pulse.db | SQLite location |

Stack

Node.js ≥ 22, Express, better-sqlite3 (WAL), helmet, compression, node-cron, cheerio (HTML scrape of the Nieman article).

No LLMs. No RAG. No external paid APIs. No keys.

Honest limits

The Nieman Lab piece names 340+ outlets but does not embed the list inline. The roster is assembled from the article's external links plus the linked methodology dataset (Ben Welsh's news-homepages sites.csv, 1,167 outlets globally, of which ~895 are tagged country=US). We filter to US and probe weekly.
State assignment is best-effort. We resolve state from the location column of the methodology CSV, from inline article context, and from a small built-in city → state lookup. Some outlets remain state=NULL until enriched. There is no hardcoded outlet-to-state seed table.
The Wayback CDX last-snapshot is an approximation of the blackout start. A site may have stopped responding to IA earlier than its last successful snapshot if subsequent crawls happened to fail before the robots.txt change went in.
We track ia_archiver and archive.org_bot per spec scope. Generalized AI-crawler policy (GPTBot, ClaudeBot, etc.) is intentionally out of scope — that's crawl-policy's remit.

License

Dataset (CSV / JSON export): CC0. Cite freely.