← back to gallery

Agent Horizon

How many steps until your AI agent breaks? Live METR leaderboard + reliability decay calculator.

aiagentsbenchmarksmetrreliabilitydecayleaderboard
Open product ↗

agent-horizon

Live leaderboard + reliability-decay calculator for frontier AI agents, built on METR's Time Horizon 1.1 public benchmark feed.

Live: https://holyai.me/agent-horizon/
Source data: https://metr.org/time-horizons/

What it does

Three things METR's own page doesn't expose directly:

  1. Sortable, per-vendor leaderboard of all tracked frontier models — release date, average score, 50%- and 80%-time horizons with confidence intervals, plus a SOTA badge.
  2. Reliability-decay calculator — pick a model, pick N steps, see P(success) = accuracy^N and the step counts at which the agent crosses the 90% / 50% / 10% reliability cliffs.
  3. Movers feed — which models had their METR horizon estimate moved ≥5% since seven days ago, plus new entries and new SOTAs.

No mock data, no seed values, no random jitter — every number on the page is read from a freshly-fetched METR YAML.

Data sources

| Source | URL | Refresh |
|---|---|---|
| METR benchmark results | https://metr.org/assets/benchmark_results_1_1.yaml | every 6 hours |
| METR task results | https://metr.org/assets/task_results_1_1.yaml | every 24 hours |

Both are public, no auth, no API key.

Stack

Auth

None. Every endpoint — read or write — is public, including /api/refresh. This is by design: the operator wants to inspect any URL without typing a password.

Endpoints

All routes are under BASE_PATH=/agent-horizon.

Run locally

npm install
npm start
# open http://localhost:4720/agent-horizon/

PORT defaults to 4720. Override via env if needed. No other configuration is required.

Production deployment

See DEPLOY_MANIFEST.json (consumed by the RNDLAB watcher: rsync → systemd → nginx → showcase POST → Playwright thumbnail).

License

MIT