← back to gallery

IA Blackout Pulse

Live tracker of US local-news outlets blocking the Internet Archive — map, robots.txt evidence, shareable cards.

researchjournalisminternet-archiverobots.txtwaybacknewsmedia-preservation
Open product ↗

ia-blackout-pulse

Live tracker of US local-news outlets that have started blocking the Internet Archive — with per-outlet robots.txt evidence, a US choropleth map, a weekly diff feed, sortable table, CSV export, and shareable per-outlet "archive blackout" cards.

In May 2026 Nieman Lab named 340+ outlets but published no list, no timeline, no methodology artifact. This service reproduces the work and keeps it current.

What it does

Three real data sources, fetched at runtime, refreshed by node-cron:

| Source | URL | Refresh interval |
|---|---|---|
| Nieman Lab article (provenance) | https://www.niemanlab.org/2026/05/more-than-340-local-news-outlets-are-limiting-the-internet-archives-access-to-their-journalism/ | weekly — 0 4 0 (Sun 04:00) |
| Linked methodology dataset | https://raw.githubusercontent.com/palewire/news-homepages/main/newshomepages/sources/sites.csv — cited inline by the article as the "database of 1,167 news websites" maintained by Ben Welsh at palewi.re/docs/news-homepages | weekly (alongside the Nieman fetch) |
| Outlet /robots.txt | https://{domain}/robots.txt — one fetch per US outlet | weekly, staggered ~0.4 s between fetches |
| Wayback Machine CDX API | http://web.archive.org/cdx/search/cdx?url={domain}&output=json&limit=-1&filter=statuscode:200&fl=timestamp,original&from=20200101 | weekly, ~1.1 s between fetches |

User-Agent on every outbound fetch: ia-blackout-pulse/1.0 (+https://holyai.me/ia-blackout-pulse; research). Timeout 15 s, no retries within a run.

Classification rules

Each outlet's /robots.txt is parsed (RFC 9309-ish). Verdict:

The Wayback CDX last-snapshot timestamp approximates when the blackout started.

Endpoints

Everything is mounted under /ia-blackout-pulse. No auth.

### Public pages
- GET /ia-blackout-pulse/ — map + sortable, searchable, filter-chip-driven table.
- GET /ia-blackout-pulse/feed — reverse-chrono diff feed of newly detected blackouts, with "Copy as Markdown" for newsletters.
- GET /ia-blackout-pulse/outlet/:slug — per-outlet detail page: status history, raw robots.txt body with offending lines highlighted, deep Wayback link, "get shareable card" button.
- GET /ia-blackout-pulse/card/:slug — 1200×630 shareable card with full OG / Twitter meta tags. Designed to render in link previews on Bluesky / X / Mastodon. Includes a "Copy embed" snippet.
- GET /ia-blackout-pulse/map — standalone fullscreen choropleth.
- GET /ia-blackout-pulse/about — methodology disclosure.

### JSON / CSV API
- GET /ia-blackout-pulse/health — {ok:true}.
- GET /ia-blackout-pulse/api/outlets?q=&state=&status=&sort=&dir=&limit=&offset= — outlet list with server-side filtering, sorting, pagination.
- GET /ia-blackout-pulse/api/outlets/:slug — one outlet with the last 50 status_history rows attached.
- GET /ia-blackout-pulse/api/robots/:slug — the most recent raw robots.txt body we captured for that outlet, its SHA-256, the HTTP status we saw, and the parsed verdict. Use this to audit our classification.
- GET /ia-blackout-pulse/api/feed?limit= — entries ordered by blackout_detected_at desc.
- GET /ia-blackout-pulse/api/stats — totals, per-status counts, by_state breakdown ({total, blocked, partial, open, unreachable} per state), last-run metadata.
- GET /ia-blackout-pulse/api/runs — cron-run history for transparency.
- POST /ia-blackout-pulse/api/run — manually trigger a full run (no auth; idempotent — refuses with HTTP 409 if a run is already in progress).
- GET /ia-blackout-pulse/export.csv — full table dump.

Schema

SQLite (better-sqlite3, WAL mode). See db.js for the exact CREATE statements. Tables:

Cron schedule

On boot, if outlets is empty, the server kicks off one async bootstrap run so the first deploy lands with data.

Run locally

npm install
PORT=4846 node server.js
# Server logs:
#   ia-blackout-pulse listening on :4846/ia-blackout-pulse (development)
#   [bootstrap] outlets table empty — kicking off initial full run

# Try it:
curl http://localhost:4846/ia-blackout-pulse/health
curl 'http://localhost:4846/ia-blackout-pulse/api/stats'
curl 'http://localhost:4846/ia-blackout-pulse/api/outlets?status=blocked&limit=10'
open http://localhost:4846/ia-blackout-pulse/
```

Useful env vars:

| Var | Default | Purpose |
|---|---|---|
| PORT | 4846 | HTTP port |
| BASE_PATH | /ia-blackout-pulse | URL prefix |
| BOOTSTRAP_PROBE_LIMIT | 0 (=all) | Cap probes on first run (useful in dev) |
| ROBOTS_SPACING_MS | 400 | Per-iteration sleep between robots fetches |
| WAYBACK_SPACING_MS | 1100 | Used in the per-iteration sleep alongside robots spacing (CDX is ~1 req/sec rate-limited) |
| DB_PATH | ./data/ia-blackout-pulse.db | SQLite location |

Stack

Node.js ≥ 22, Express, better-sqlite3 (WAL), helmet, compression, node-cron, cheerio (HTML scrape of the Nieman article).

No LLMs. No RAG. No external paid APIs. No keys.

Honest limits

License

Dataset (CSV / JSON export): CC0. Cite freely.