agent-uplift

The harness uplift leaderboard. For every frontier LLM, the score boost from picking the best open-source agent harness on SWE-bench Verified, Lite, and Multi-SWE-bench. Real public submissions, refreshed hourly, zero auth.

Live: https://holyai.me/agent-uplift/

Why this exists

In May 2026 the AI coding agent community had a quiet realization: the harness matters more than the weights. Take the same model, wrap it in swe-agent versus aider versus cline versus codex-cli, and the SWE-bench Verified score can move by 5 to 30 points. The official SWE-bench leaderboard lists each submission once, in submission order — it does not group by model, it does not surface the delta.

agent-uplift answers a single question per model:

If I run this model, which harness gives me the biggest free score boost, and by how many points?

What you see

A per-model leaderboard sorted by uplift, descending.
Each row: model · bare baseline % · best ceiling % · best harness · uplift (signed) · # submissions.
Click a row → drawer with every (harness, score) pair for that model, plus raw GitHub links to each metadata.yaml and results.json.
A second table ranking harnesses by mean uplift delivered across the models they hold ceilings for.
A "movers" feed of the last 14 days of submissions.
A one-click 1200×630 OG share card per model — Twitter-ready.

Data sources (NO MOCKS)

Every datapoint is pulled at runtime from a real, public, no-auth HTTP endpoint:

| Source | URL pattern | Refresh |
| --- | --- | --- |
| swe-bench/experiments — Verified | https://api.github.com/repos/swe-bench/experiments/contents/evaluation/verified | every 60 min |
| swe-bench/experiments — Lite | https://api.github.com/repos/swe-bench/experiments/contents/evaluation/lite | every 60 min |
| per-submission metadata | https://raw.githubusercontent.com/swe-bench/experiments/main/evaluation/<bench>/<dir>/metadata.yaml | on directory diff |
| per-submission results | https://raw.githubusercontent.com/swe-bench/experiments/main/evaluation/<bench>/<dir>/results/results.json | on directory diff |
| multi-swe-bench results | https://api.github.com/repos/multi-swe-bench/multi-swe-bench/contents/results | every 60 min |
| harness repo metadata | https://api.github.com/repos/<owner>/<repo> | every 6 hours |

Every response is ETag-cached in SQLite (http_cache table) so the unauthenticated 60 req/hour rate limit is plenty. If a GITHUB_TOKEN env var is set, the limit rises to 5000/hour.

Methodology

Baseline per model = lowest-scoring submission whose harness is tagged minimal (agentless, minimal, bare, vanilla, oneshot).
If no minimal-scaffold submission exists for that model, baseline = lowest-scoring submission across all harnesses for that model (the UI labels this case explicitly).
Ceiling per model = highest-scoring submission across all harnesses for that model.
Uplift = ceiling − baseline.
No mocks. No seeded fallback. If a source 404s, the row is dropped — never invented.

Tech

Node.js 20 + Express
better-sqlite3 in WAL mode
node-cron
helmet + compression
Vanilla JS SPA (no framework, no build step), dark theme, English UI
Mounted under BASE_PATH = /agent-uplift
Zero auth. Every endpoint — read and write — is public.

Running locally

npm install
PORT=4896 node server.js
# open http://localhost:4896/agent-uplift/

Endpoints

GET  /agent-uplift/                      SPA shell
GET  /agent-uplift/health                liveness + cron health JSON
GET  /agent-uplift/api/uplift            per-model uplift leaderboard
GET  /agent-uplift/api/model/:model      per-model detail + raw submission links
GET  /agent-uplift/api/harness/:harness  per-harness view
GET  /agent-uplift/api/harnesses         harness impact table
GET  /agent-uplift/api/movers            new submissions in window
GET  /agent-uplift/api/stats             aggregate stats
GET  /agent-uplift/api/methodology       methodology JSON
GET  /agent-uplift/methodology           methodology HTML
GET  /agent-uplift/share/:model.svg      1200×630 OG share card
POST /agent-uplift/api/refresh           trigger an out-of-cycle refresh (rate-limited 1/min/IP)

License

MIT. Part of the Holy AI / Cowork R&D fleet.