agent-horizon
Live leaderboard + reliability-decay calculator for frontier AI agents, built on METR's Time Horizon 1.1 public benchmark feed.
Live: https://holyai.me/agent-horizon/
Source data: https://metr.org/time-horizons/
What it does
Three things METR's own page doesn't expose directly:
- Sortable, per-vendor leaderboard of all tracked frontier models — release date, average score, 50%- and 80%-time horizons with confidence intervals, plus a SOTA badge.
- Reliability-decay calculator — pick a model, pick
Nsteps, seeP(success) = accuracy^Nand the step counts at which the agent crosses the 90% / 50% / 10% reliability cliffs. - Movers feed — which models had their METR horizon estimate moved ≥5% since seven days ago, plus new entries and new SOTAs.
No mock data, no seed values, no random jitter — every number on the page is read from a freshly-fetched METR YAML.
Data sources
| Source | URL | Refresh |
|---|---|---|
| METR benchmark results | https://metr.org/assets/benchmark_results_1_1.yaml | every 6 hours |
| METR task results | https://metr.org/assets/task_results_1_1.yaml | every 24 hours |
Both are public, no auth, no API key.
Stack
- Node.js 20+
- Express 4
better-sqlite3(WAL)node-cronhelmet+compressionjs-yaml,undici- Vanilla JS SPA + Chart.js (CDN)
Auth
None. Every endpoint — read or write — is public, including /api/refresh. This is by design: the operator wants to inspect any URL without typing a password.
Endpoints
All routes are under BASE_PATH=/agent-horizon.
GET /— homepage SPAGET /model/:slug— per-model detail pageGET /about— methodology + data sourcesGET /health— auth-free, 200 JSONGET /api/leaderboard— latest snapshot, rows sorted by p50 descGET /api/model/:slug— meta + full snapshot history for one modelGET /api/doubling— doubling-time figures + scatter pointsGET /api/movers?days=7— diff against the snapshot ~N days agoGET /api/snapshot/latest— latest snapshot row (debug)GET /api/refresh— force a refresh (1 call / 60s / IP)POST /api/decay— stateless reliability-decay math; body{ slug?, steps, accuracy_override? }
Run locally
npm install
npm start
# open http://localhost:4720/agent-horizon/
PORT defaults to 4720. Override via env if needed. No other configuration is required.
Production deployment
See DEPLOY_MANIFEST.json (consumed by the RNDLAB watcher: rsync → systemd → nginx → showcase POST → Playwright thumbnail).
License
MIT