bench-rot
Live trust index for AI benchmarks. Are the leaderboards still measuring what they claim?
bench-rot scores how trustworthy each major AI benchmark still is — based on
real-time signals: contamination preprints, reward-hacking incidents, vendor
abandonment, and benchmark-repo health. It is a meta-tracker: it does not run
benchmarks, it watches the integrity of the benchmarks themselves.
Released May 2026, in the wake of OpenAI publicly retiring SWE-bench Verified
and a wave of papers documenting reward hacking on GAIA, WebArena, OSWorld,
Terminal-Bench, and more.
What it does
For each of 20 tracked benchmarks (SWE-bench, SWE-bench Verified, SWE-bench Pro,
SWE-bench Live, HumanEval, HumanEval+, LiveCodeBench, Aider Polyglot, MMLU,
MMLU-Pro, GPQA, FrontierMath, AIME, ARC-AGI, GAIA, WebArena, OSWorld,
Terminal-Bench, Berkeley Function Calling, LiveBench), bench-rot:
- Pulls the last 365 days of arxiv preprints that match the benchmark name
- together with contamination/leakage or reward-hacking keywords.
- Pulls the last 365 days of Hacker News stories that match the benchmark
- name together with exploit / concern keywords.
- Pulls GitHub repo health — stars, open issues, last commit on default
- branch — for the benchmark's source repo.
- Combines those signals plus deterministic registry facts (does a Pro/V2/Live
- successor exist? did a frontier vendor publicly stop reporting?) into a
- transparent Trust Score in the range 0–100, with a per-row breakdown that
- shows exactly which signals docked which points.
- Renders an interactive dashboard (sortable table, drawer with timeline,
- incidents feed, methodology page) and exposes shields-style **embeddable
- trust badges** at
/bench-rot/badge/<slug>.svg.
Real, public data sources (NO mock, NO seed)
| Source | Endpoint | What we pull | Refresh |
|---|---|---|---|
| arxiv | http://export.arxiv.org/api/query | Atom XML preprint metadata | every 6 h |
| HN Algolia | https://hn.algolia.com/api/v1/search_by_date | stories from last 365 days | every 30 min |
| GitHub REST | https://api.github.com/repos/{owner}/{repo} | stars, issues, default branch | every 6 h |
| GitHub REST | https://api.github.com/repos/{owner}/{repo}/commits?per_page=1 | most recent commit timestamp | every 6 h |
All endpoints are anonymous-public. No API keys are required for the MVP.
Optionally, a GITHUB_TOKEN env var lifts anonymous limits but is not needed.
If a fetcher fails, the dashboard falls back to deterministic registry signals
(successor, frontier abandonment) and shows a banner — never to fabricated
data.
Trust Score formula (deterministic, explainable)
score = 100
score -= 20 * contamination_papers_365d
score -= 10 * reward_hack_papers_365d
score -= 10 * hn_exploit_stories_365d
score -= 5 * hn_concern_stories_365d
score -= 15 if has_successor
score -= 25 if frontier_status == 'true'
score -= 10 if frontier_status == 'partial'
score -= 5 if repo exists and last commit > 180 days ago
score += 5 if repo exists and last commit within 30 days
clamp [0, 100]
Tiers: 80–100 Trustworthy (green), 60–79 Caution (yellow), 40–59 Eroded
(orange), 0–39 Compromised (red).
Run locally
cp .env.example .env
npm install
node server.js
Open http://localhost:4735/bench-rot/.
On first boot the database is seeded with the registry; all three fetchers
kick off in the background. The dashboard is usable immediately and populates
within a minute or two as fetchers return.
Endpoints (all public, no auth)
| Path | Description |
|---|---|
| GET /bench-rot/ | SPA shell |
| GET /bench-rot/health | {ok, ts, db_ok, benchmark_count, last_fetcher_runs} |
| GET /bench-rot/api/benchmarks | Current Trust Scores for every benchmark |
| GET /bench-rot/api/benchmarks/:slug | Detail with breakdown, papers, HN, history |
| GET /bench-rot/api/incidents | Combined feed of recent papers + HN stories |
| GET /bench-rot/api/stats | Aggregate counts and tier breakdown |
| GET /bench-rot/api/refresh-status | Fetcher run summary |
| GET /bench-rot/api/methodology | Source list, formula, tier table |
| GET /bench-rot/api/export.csv | CSV of current scores |
| GET /bench-rot/badge/:slug.svg | Shields-style embeddable trust badge |
Stack
- Node.js 20+ · Express 4 · better-sqlite3 (WAL) · node-cron · helmet ·
- compression
- Vanilla JS SPA, no build step. Dark theme, Inter + JetBrains Mono.
Deployment
DEPLOY_MANIFEST.json is consumed by the RNDLAB orchestrator (rsync → systemd
→ nginx → showcase POST). The Mac Mini watcher picks up the trigger fromcowork-deploy-bridge/queue.jsonl.
Runtime port: 4735. Base path: /bench-rot.
Why this exists
Existing leaderboards score models on benchmarks. bench-rot flips it: the
benchmarks themselves are the rows. PMs and researchers picking what to
test against in May 2026 deserve a single place to ask "is this still a fair
fight?" — and a transparent score, not vibes.
License
MIT.