← back to gallery

Bench-Spend

Score-vs-cost Pareto for every major AI benchmark, live.

aibenchmarkspricingparetoleaderboardai-evaluationlive-data
Open product ↗

bench-spend

Live, public dashboard plotting score vs. cost for every major AI benchmark — find the cheapest model that hits your bar.

Live: https://holyai.me/bench-spend/
Port: 4892
Base path: /bench-spend
Auth: none on any endpoint (read or write)

---

Why this exists

The May 2026 frontier-AI landscape has fragmented into a dozen incomparable
leaderboards. eval-leaderboard, hallu-board, judge-floor, agent-horizon,
contam-gap, harness-arena all rank models on accuracy. token-floor,
reasoning-tax, cache-arena, voice-cost-clock, provider-drift all rank
models on price. Nobody plots both on the same axis.

crossover-clock answered "has AI crossed the human baseline?". bench-spend
answers the next question — "yes, but what does it cost to get there?"

For every major benchmark, bench-spend shows a real-data score-vs-cost
scatter, highlights the Pareto frontier (cheapest model at every score level),
and lets you slide a "min score" bar to find the cheapest model that still
clears it.

What it shows

Data sources (real, public, no mocks)

| Source | URL | Used for | Refresh |
|---|---|---|---|
| swebench.com | https://www.swebench.com/ | SWE-bench Verified scores + native per-instance cost | every 6 h |
| artificialanalysis.ai | https://artificialanalysis.ai/models | GPQA, MMLU-Pro, AIME, MATH-500, HLE scores | hourly |
| OpenRouter API | https://openrouter.ai/api/v1/models | per-model input / output token prices | hourly |

For benchmarks where the source doesn't publish per-instance cost, we estimate
it as input_tokens prompt_per_token + output_tokens completion_per_token,
using the per-benchmark token budgets in lib/cost_estimator.js. Each token
budget cites its source. For SWE-bench, we take the native instance_cost
field straight from the leaderboard JSON — no estimation.

There is no seed data, no Math.random(), no hardcoded scores or prices.
The models, scores, and prices tables start empty and only get rows from
the live fetchers. If a fetcher fails, the UI surfaces "no data yet" for that
benchmark rather than a fabricated number. The exact fetch log is exposed at
/bench-spend/api/fetch-log.

API

All endpoints public, no auth, JSON.

| Method | Path | Purpose |
|---|---|---|
| GET | /bench-spend/health | liveness + last fetch per source |
| GET | /bench-spend/api/benchmarks | benchmark catalog + per-benchmark row counts |
| GET | /bench-spend/api/benchmark/:slug | every (model, score, cost) row for the benchmark |
| GET | /bench-spend/api/pareto/:slug | precomputed Pareto frontier + dominated set |
| GET | /bench-spend/api/cheapest?slug=…&min_score=0.7&open_weights=true | cheapest model meeting min_score |
| GET | /bench-spend/api/budgets | token-budget defaults per benchmark with citations |
| GET | /bench-spend/api/fetch-log | last 100 fetch attempts |
| GET | /bench-spend/api/refresh | trigger a manual refresh (rate-limited per IP) |

Local dev

cp .env.example .env
npm install
node server.js
# open http://localhost:4892/bench-spend/

better-sqlite3 is a native module. On macOS arm64 / common Linux arm64+x64,
npm fetches a prebuilt binary; the package still works in sandbox mode where
the source compile can fail (we don't rebuild it).

Stack

Layout

See CLAUDE.md for the file map and extension points.

License

This repository is part of the Holy AI / Cowork R&D gallery. Code is internal;
the running dashboard is publicly readable at the URL above.