bench-spend

Live, public dashboard plotting score vs. cost for every major AI benchmark — find the cheapest model that hits your bar.

Live: https://holyai.me/bench-spend/
Port: 4892
Base path: /bench-spend
Auth: none on any endpoint (read or write)

---

Why this exists

The May 2026 frontier-AI landscape has fragmented into a dozen incomparable
leaderboards. eval-leaderboard, hallu-board, judge-floor, agent-horizon,
contam-gap, harness-arena all rank models on accuracy. token-floor,
reasoning-tax, cache-arena, voice-cost-clock, provider-drift all rank
models on price. Nobody plots both on the same axis.

crossover-clock answered "has AI crossed the human baseline?". bench-spend
answers the next question — "yes, but what does it cost to get there?"

For every major benchmark, bench-spend shows a real-data score-vs-cost
scatter, highlights the Pareto frontier (cheapest model at every score level),
and lets you slide a "min score" bar to find the cheapest model that still
clears it.

What it shows

A live scatter chart per benchmark: x = cost per instance (log scale),
y = score. Points are colored by vendor. The Pareto frontier is drawn as a
stepped line through the cheapest model at every score tier.
A "cheapest at min score" widget: drag a slider; see the cheapest
Pareto-optimal model that still hits your bar. Optionally restrict to
open-weights models.
A sortable table of every (model, agent) row, with the Pareto-optimal
rows highlighted, sortable by cost / score / score-per-dollar.

Data sources (real, public, no mocks)

| Source | URL | Used for | Refresh |
|---|---|---|---|
| swebench.com | https://www.swebench.com/ | SWE-bench Verified scores + native per-instance cost | every 6 h |
| artificialanalysis.ai | https://artificialanalysis.ai/models | GPQA, MMLU-Pro, AIME, MATH-500, HLE scores | hourly |
| OpenRouter API | https://openrouter.ai/api/v1/models | per-model input / output token prices | hourly |

For benchmarks where the source doesn't publish per-instance cost, we estimate
it as input_tokens prompt_per_token + output_tokens completion_per_token,
using the per-benchmark token budgets in lib/cost_estimator.js. Each token
budget cites its source. For SWE-bench, we take the native instance_cost
field straight from the leaderboard JSON — no estimation.

There is no seed data, no Math.random(), no hardcoded scores or prices.
The models, scores, and prices tables start empty and only get rows from
the live fetchers. If a fetcher fails, the UI surfaces "no data yet" for that
benchmark rather than a fabricated number. The exact fetch log is exposed at
/bench-spend/api/fetch-log.

API

All endpoints public, no auth, JSON.

| Method | Path | Purpose |
|---|---|---|
| GET | /bench-spend/health | liveness + last fetch per source |
| GET | /bench-spend/api/benchmarks | benchmark catalog + per-benchmark row counts |
| GET | /bench-spend/api/benchmark/:slug | every (model, score, cost) row for the benchmark |
| GET | /bench-spend/api/pareto/:slug | precomputed Pareto frontier + dominated set |
| GET | /bench-spend/api/cheapest?slug=…&min_score=0.7&open_weights=true | cheapest model meeting min_score |
| GET | /bench-spend/api/budgets | token-budget defaults per benchmark with citations |
| GET | /bench-spend/api/fetch-log | last 100 fetch attempts |
| GET | /bench-spend/api/refresh | trigger a manual refresh (rate-limited per IP) |

Local dev

cp .env.example .env
npm install
node server.js
# open http://localhost:4892/bench-spend/

better-sqlite3 is a native module. On macOS arm64 / common Linux arm64+x64,
npm fetches a prebuilt binary; the package still works in sandbox mode where
the source compile can fail (we don't rebuild it).

Stack

Node.js 20+ / Express 4
better-sqlite3 (WAL mode), node-cron
helmet, compression
cheerio + native fetch
vanilla JS SPA, Chart.js via CDN, dark theme

Layout

See CLAUDE.md for the file map and extension points.

License

This repository is part of the Holy AI / Cowork R&D gallery. Code is internal;
the running dashboard is publicly readable at the URL above.