agent-horizon

Live leaderboard + reliability-decay calculator for frontier AI agents, built on METR's Time Horizon 1.1 public benchmark feed.

Live: https://holyai.me/agent-horizon/
Source data: https://metr.org/time-horizons/

What it does

Three things METR's own page doesn't expose directly:

Sortable, per-vendor leaderboard of all tracked frontier models — release date, average score, 50%- and 80%-time horizons with confidence intervals, plus a SOTA badge.
Reliability-decay calculator — pick a model, pick N steps, see P(success) = accuracy^N and the step counts at which the agent crosses the 90% / 50% / 10% reliability cliffs.
Movers feed — which models had their METR horizon estimate moved ≥5% since seven days ago, plus new entries and new SOTAs.

No mock data, no seed values, no random jitter — every number on the page is read from a freshly-fetched METR YAML.

Data sources

| Source | URL | Refresh |
|---|---|---|
| METR benchmark results | https://metr.org/assets/benchmark_results_1_1.yaml | every 6 hours |
| METR task results | https://metr.org/assets/task_results_1_1.yaml | every 24 hours |

Both are public, no auth, no API key.

Stack

Node.js 20+
Express 4
better-sqlite3 (WAL)
node-cron
helmet + compression
js-yaml, undici
Vanilla JS SPA + Chart.js (CDN)

Auth

None. Every endpoint — read or write — is public, including /api/refresh. This is by design: the operator wants to inspect any URL without typing a password.

Endpoints

All routes are under BASE_PATH=/agent-horizon.

GET / — homepage SPA
GET /model/:slug — per-model detail page
GET /about — methodology + data sources
GET /health — auth-free, 200 JSON
GET /api/leaderboard — latest snapshot, rows sorted by p50 desc
GET /api/model/:slug — meta + full snapshot history for one model
GET /api/doubling — doubling-time figures + scatter points
GET /api/movers?days=7 — diff against the snapshot ~N days ago
GET /api/snapshot/latest — latest snapshot row (debug)
GET /api/refresh — force a refresh (1 call / 60s / IP)
POST /api/decay — stateless reliability-decay math; body { slug?, steps, accuracy_override? }

Run locally

npm install
npm start
# open http://localhost:4720/agent-horizon/

PORT defaults to 4720. Override via env if needed. No other configuration is required.

Production deployment

See DEPLOY_MANIFEST.json (consumed by the RNDLAB watcher: rsync → systemd → nginx → showcase POST → Playwright thumbnail).

License

MIT