← back to gallery

bench-picker

Run 5 benchmarks instead of 57 — minimum subset that predicts the rest at R²≥0.90

aillmbenchmarksevaluationmteblivebenchsubmodularopen-llm-leaderboard
Open product ↗

bench-picker

Run 5 benchmarks instead of 57.

Picks the minimum subset of LLM benchmarks you need to run to predict the rest
within a chosen R² (e.g. ≥ 0.90), using submodular greedy mutual-information
selection on real, live leaderboard data.

The core insight, from Krause/Singh/Guestrin's 2008 sensor-placement paper
applied to benchmark suites: if benchmark scores are highly correlated across
models (and they are), most of an N-task suite is informationally redundant.
A small, carefully chosen subset will predict the rest.

bench-picker turns that idea into a one-click tool: pick a benchmark family
(HF Open LLM Leaderboard, MTEB, LiveBench), pick k, get the exact tasks to
run plus a shareable badge showing GPU-hours saved per checkpoint.

What it does

Data sources

All scores pulled at runtime from public endpoints. Nothing hardcoded.

| Family | URL | Refresh |
|---|---|---|
| HF Open LLM Leaderboard v2 | https://datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard/contents (paginated) | weekly Mon 03:00 UTC, also on-startup if missing/>14d stale |
| MTEB results | https://huggingface.co/datasets/mteb/results (via paths.json + per-task result JSONs) | weekly |
| LiveBench | https://livebench.ai/table_<latest-date>.csv (latest dated CSV, auto-discovered by HEAD probe) | weekly |

Per-family GPU-cost references (from public benchmark docs, not your hardware —
the UI labels them "reference cost, swap your own"):

Dollars assumes $2.50/A100-hr (industry-typical on-demand pricing).

API endpoints (all under /bench-picker)

| Method | Path | What |
|---|---|---|
| GET | /health | {ok:true} (no auth, no DB) |
| GET | /api/families | list of {slug, name, model_count, task_count, last_fetched_at, status, full_run_hours} |
| GET | /api/family/:slug | family detail + tasks + sample models |
| POST | /api/pick | body {family, k} or {family, target_r2}{chosen[], mean_r2, per_task_r2{}, gpu_hours_saved, dollar_saved, scatter} |
| POST | /api/share | persist a pick, returns {id, url} |
| GET | /api/share/:id | retrieve a stored pick |
| GET | /share/:id | HTML share page (og-image points at badge SVG) |
| GET | /badge/:id.svg | inline SVG badge for model card READMEs |
| GET | /api/stats | header strip stats |

Algorithm

algo/greedy.js — at each step pick column c maximizing
0.5 · log(var(c | S) / var(c | V\(S∪{c}))) where S is the chosen set and V is
all columns. Variances come from the column-centered covariance matrix via
Cholesky-based conditional Gaussian variance. This is the Krause/Singh/Guestrin
2008 sensor placement greedy with the standard (1 − 1/e) submodularity bound.

algo/regression.js — for every held-out task t ∉ S, fits ridge (λ=1e-3) on the
selected columns and predicts t. Uses leave-one-model-out CV when M ≤ 80;
otherwise 5-fold for speed. Per-task R² averaged to a single mean R².

target_r2 mode linearly scans k = 1..20 and returns the smallest k whose
mean held-out R² meets the threshold; if no k≤20 reaches it, returns the best.

CLI

node cli.js pick --family=open-llm-v2 --k=2
node cli.js pick --family=mteb --target-r2=0.9
node cli.js refresh --family=livebench
node cli.js list

Run locally

npm install
node server.js
# open http://localhost:4875/bench-picker/

First boot will async-fetch all 3 leaderboards. Open LLM takes ~10 seconds,
LiveBench ~45 seconds (it auto-discovers the latest snapshot date by HEAD
probing), MTEB ~2 minutes (parallel JSON fetches across 40 models × 24 tasks).
A yellow banner shows on the page while sources are still loading or if any are
stale (>14 days).

References

Out of scope