← back to gallery

bench-rot

Live trust index for AI benchmarks — are the leaderboards still measuring what they claim?

aibenchmarksevaluationcontaminationreward-hackingai-safetyleaderboard
Open product ↗

bench-rot

Live trust index for AI benchmarks. Are the leaderboards still measuring what they claim?

bench-rot scores how trustworthy each major AI benchmark still is — based on
real-time signals: contamination preprints, reward-hacking incidents, vendor
abandonment, and benchmark-repo health. It is a meta-tracker: it does not run
benchmarks, it watches the integrity of the benchmarks themselves.

Released May 2026, in the wake of OpenAI publicly retiring SWE-bench Verified
and a wave of papers documenting reward hacking on GAIA, WebArena, OSWorld,
Terminal-Bench, and more.

What it does

For each of 20 tracked benchmarks (SWE-bench, SWE-bench Verified, SWE-bench Pro,
SWE-bench Live, HumanEval, HumanEval+, LiveCodeBench, Aider Polyglot, MMLU,
MMLU-Pro, GPQA, FrontierMath, AIME, ARC-AGI, GAIA, WebArena, OSWorld,
Terminal-Bench, Berkeley Function Calling, LiveBench), bench-rot:

  1. Pulls the last 365 days of arxiv preprints that match the benchmark name
  2. together with contamination/leakage or reward-hacking keywords.
  3. Pulls the last 365 days of Hacker News stories that match the benchmark
  4. name together with exploit / concern keywords.
  5. Pulls GitHub repo health — stars, open issues, last commit on default
  6. branch — for the benchmark's source repo.
  7. Combines those signals plus deterministic registry facts (does a Pro/V2/Live
  8. successor exist? did a frontier vendor publicly stop reporting?) into a
  9. transparent Trust Score in the range 0–100, with a per-row breakdown that
  10. shows exactly which signals docked which points.
  11. Renders an interactive dashboard (sortable table, drawer with timeline,
  12. incidents feed, methodology page) and exposes shields-style **embeddable
  13. trust badges** at /bench-rot/badge/<slug>.svg.

Real, public data sources (NO mock, NO seed)

| Source | Endpoint | What we pull | Refresh |
|---|---|---|---|
| arxiv | http://export.arxiv.org/api/query | Atom XML preprint metadata | every 6 h |
| HN Algolia | https://hn.algolia.com/api/v1/search_by_date | stories from last 365 days | every 30 min |
| GitHub REST | https://api.github.com/repos/{owner}/{repo} | stars, issues, default branch | every 6 h |
| GitHub REST | https://api.github.com/repos/{owner}/{repo}/commits?per_page=1 | most recent commit timestamp | every 6 h |

All endpoints are anonymous-public. No API keys are required for the MVP.
Optionally, a GITHUB_TOKEN env var lifts anonymous limits but is not needed.

If a fetcher fails, the dashboard falls back to deterministic registry signals
(successor, frontier abandonment) and shows a banner — never to fabricated
data.

Trust Score formula (deterministic, explainable)

score = 100
score -= 20 * contamination_papers_365d
score -= 10 * reward_hack_papers_365d
score -= 10 * hn_exploit_stories_365d
score -=  5 * hn_concern_stories_365d
score -= 15 if has_successor
score -= 25 if frontier_status == 'true'
score -= 10 if frontier_status == 'partial'
score -=  5 if repo exists and last commit > 180 days ago
score +=  5 if repo exists and last commit within 30 days
clamp [0, 100]

Tiers: 80–100 Trustworthy (green), 60–79 Caution (yellow), 40–59 Eroded
(orange), 0–39 Compromised (red).

Run locally

cp .env.example .env
npm install
node server.js

Open http://localhost:4735/bench-rot/.

On first boot the database is seeded with the registry; all three fetchers
kick off in the background. The dashboard is usable immediately and populates
within a minute or two as fetchers return.

Endpoints (all public, no auth)

| Path | Description |
|---|---|
| GET /bench-rot/ | SPA shell |
| GET /bench-rot/health | {ok, ts, db_ok, benchmark_count, last_fetcher_runs} |
| GET /bench-rot/api/benchmarks | Current Trust Scores for every benchmark |
| GET /bench-rot/api/benchmarks/:slug | Detail with breakdown, papers, HN, history |
| GET /bench-rot/api/incidents | Combined feed of recent papers + HN stories |
| GET /bench-rot/api/stats | Aggregate counts and tier breakdown |
| GET /bench-rot/api/refresh-status | Fetcher run summary |
| GET /bench-rot/api/methodology | Source list, formula, tier table |
| GET /bench-rot/api/export.csv | CSV of current scores |
| GET /bench-rot/badge/:slug.svg | Shields-style embeddable trust badge |

Stack

Deployment

DEPLOY_MANIFEST.json is consumed by the RNDLAB orchestrator (rsync → systemd
→ nginx → showcase POST). The Mac Mini watcher picks up the trigger from
cowork-deploy-bridge/queue.jsonl.

Runtime port: 4735. Base path: /bench-rot.

Why this exists

Existing leaderboards score models on benchmarks. bench-rot flips it: the
benchmarks themselves are the rows. PMs and researchers picking what to
test against in May 2026 deserve a single place to ask "is this still a fair
fight?" — and a transparent score, not vibes.

License

MIT.