bench-picker

Run 5 benchmarks instead of 57.

Picks the minimum subset of LLM benchmarks you need to run to predict the rest
within a chosen R² (e.g. ≥ 0.90), using submodular greedy mutual-information
selection on real, live leaderboard data.

The core insight, from Krause/Singh/Guestrin's 2008 sensor-placement paper
applied to benchmark suites: if benchmark scores are highly correlated across
models (and they are), most of an N-task suite is informationally redundant.
A small, carefully chosen subset will predict the rest.

bench-picker turns that idea into a one-click tool: pick a benchmark family
(HF Open LLM Leaderboard, MTEB, LiveBench), pick k, get the exact tasks to
run plus a shareable badge showing GPU-hours saved per checkpoint.

What it does

Three real leaderboards, fetched at startup + weekly cron, written to
SQLite — never any mock or seeded data.
Submodular MI greedy column selection on the model × task score matrix
(algo/greedy.js).
Held-out validation via leave-one-model-out ridge regression
(algo/regression.js). Returns mean R² + per-task R².
Target-R² mode: ask for the smallest k that hits a given R² threshold.
Shareable badge SVG for model card READMEs — the viral artifact.
CLI + Web UI.

Data sources

All scores pulled at runtime from public endpoints. Nothing hardcoded.

| Family | URL | Refresh |
|---|---|---|
| HF Open LLM Leaderboard v2 | https://datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard/contents (paginated) | weekly Mon 03:00 UTC, also on-startup if missing/>14d stale |
| MTEB results | https://huggingface.co/datasets/mteb/results (via paths.json + per-task result JSONs) | weekly |
| LiveBench | https://livebench.ai/table_<latest-date>.csv (latest dated CSV, auto-discovered by HEAD probe) | weekly |

Per-family GPU-cost references (from public benchmark docs, not your hardware —
the UI labels them "reference cost, swap your own"):

HF Open LLM v2: ~18 GPU-hours per full run (IFEval + BBH + MATH + GPQA + MUSR + MMLU-Pro on a 13B model)
MTEB English: ~24 GPU-hours per full run
LiveBench monthly: ~6 GPU-hours per full run

Dollars assumes $2.50/A100-hr (industry-typical on-demand pricing).

API endpoints (all under `/bench-picker`)

| Method | Path | What |
|---|---|---|
| GET | /health | {ok:true} (no auth, no DB) |
| GET | /api/families | list of {slug, name, model_count, task_count, last_fetched_at, status, full_run_hours} |
| GET | /api/family/:slug | family detail + tasks + sample models |
| POST | /api/pick | body {family, k} or {family, target_r2} → {chosen[], mean_r2, per_task_r2{}, gpu_hours_saved, dollar_saved, scatter} |
| POST | /api/share | persist a pick, returns {id, url} |
| GET | /api/share/:id | retrieve a stored pick |
| GET | /share/:id | HTML share page (og-image points at badge SVG) |
| GET | /badge/:id.svg | inline SVG badge for model card READMEs |
| GET | /api/stats | header strip stats |

Algorithm

algo/greedy.js — at each step pick column c maximizing
0.5 · log(var(c | S) / var(c | V\(S∪{c}))) where S is the chosen set and V is
all columns. Variances come from the column-centered covariance matrix via
Cholesky-based conditional Gaussian variance. This is the Krause/Singh/Guestrin
2008 sensor placement greedy with the standard (1 − 1/e) submodularity bound.

algo/regression.js — for every held-out task t ∉ S, fits ridge (λ=1e-3) on the
selected columns and predicts t. Uses leave-one-model-out CV when M ≤ 80;
otherwise 5-fold for speed. Per-task R² averaged to a single mean R².

target_r2 mode linearly scans k = 1..20 and returns the smallest k whose
mean held-out R² meets the threshold; if no k≤20 reaches it, returns the best.

CLI

node cli.js pick --family=open-llm-v2 --k=2
node cli.js pick --family=mteb --target-r2=0.9
node cli.js refresh --family=livebench
node cli.js list

Run locally

npm install
node server.js
# open http://localhost:4875/bench-picker/

First boot will async-fetch all 3 leaderboards. Open LLM takes ~10 seconds,
LiveBench ~45 seconds (it auto-discovers the latest snapshot date by HEAD
probing), MTEB ~2 minutes (parallel JSON fetches across 40 models × 24 tasks).
A yellow banner shows on the page while sources are still loading or if any are
stale (>14 days).

References

Krause, Singh, Guestrin (2008). *Near-Optimal Sensor Placements in Gaussian
Processes: Theory, Efficient Algorithms and Empirical Studies.* JMLR 9.
Smola, Alex (2024). "You don't need all the LLM benchmarks" — sparked the
idea that benchmark suites are mostly informationally redundant.
HF Open LLM Leaderboard v2: <https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard>
MTEB: <https://github.com/embeddings-benchmark/mteb>
LiveBench: <https://livebench.ai/>

Out of scope

Running benchmarks for you (no inference, no GPU)
Uploading custom benchmark CSVs (v2 maybe)
Tracking benchmark drift / contamination (that's bench-rot)
Ranking models on a leaderboard (that's eval-leaderboard)
Single-task probes like NIAH (that's needle-board)
Authentication, payments, multi-tenancy