contam-gap
Cross-split contamination-gap leaderboard for AI coding models. If a model crushes SWE-Bench Verified but flops on Multilingual, it's not a coder — it's a memorizer.
Live: https://holyai.me/contam-gap/
Port: 4869
Auth: None — every endpoint is public.
What it does
For each AI coding model that has been submitted to multiple SWE-bench/experiments splits, contam-gap shows the gap between its highest score (typically on the contaminated Verified split) and its lowest score (typically on a newer, less-contaminated split). The bigger the gap, the more likely the model is overfit to public test data instead of genuinely capable.
| Verdict | Rule |
|---|---|
| Generalizes | splits ≥ 2 and max_gap ≤ 8 pts |
| Suspicious | splits ≥ 2 and 8 < max_gap ≤ 20 |
| Overfit | splits ≥ 2 and max_gap > 20 |
| Insufficient data | only one split available |
Why it exists (May 2026)
- SWE-Bench Pro (Scale AI, May 2026) showed top agents drop ~35 points from Verified to Pro — a clear contamination signal.
- The SWE-Bench team has been shipping fresh, harder splits — Multilingual (23 languages), Bash-only, Multimodal — but nobody surfaces the per-model gap.
- Engineers picking a coding model for production want one number: "Will this still work on code that's not in its training set?" contam-gap answers it.
Data sources
| Source | URL | Frequency | On failure |
|---|---|---|---|
| GitHub Contents API | https://api.github.com/repos/SWE-bench/experiments/contents/evaluation/{split} | every 6 h | retain last good listing; mark last_fetch_ok=0; retry next cycle |
| Raw metadata.yaml per submission | https://raw.githubusercontent.com/SWE-bench/experiments/main/evaluation/{split}/{folder}/metadata.yaml | per new/changed folder | skip submission, log, retry |
| Raw README.md per submission | https://raw.githubusercontent.com/SWE-bench/experiments/main/evaluation/{split}/{folder}/README.md | per new/changed folder | skip submission, log, retry |
GITHUB_TOKEN is optional; without it we use the unauthenticated 60 req/h budget and rely on SHA caching so only changed folders are re-fetched.
Endpoints
All endpoints sit under /contam-gap. None require auth.
GET /contam-gap/health
GET /contam-gap/api/leaderboard?sort=gap|verified|splits|model&min_splits=1..4&verdict=Overfit
GET /contam-gap/api/model/:canon_model
GET /contam-gap/api/feed
GET /contam-gap/api/stats
POST /contam-gap/api/refresh # fire-and-forget manual refresh
GET /contam-gap/card/:canon_model.svg
Running locally
cp .env.example .env
npm install
node server.js
# http://localhost:4869/contam-gap/
The boot kick runs an immediate ingest so the leaderboard isn't empty.
Stack
Node.js 18+ · Express · helmet · compression · better-sqlite3 (WAL) · node-cron · js-yaml. Frontend is vanilla JS, dark theme, no build step.
Rules of the house
- No mock data. No
Math.randomjitter. If GitHub is down, the leaderboard shows yesterday's data with a note. - No auth, ever. The admin needs to inspect things at speed.
- English everywhere. International audience.