agent-uplift
The harness uplift leaderboard. For every frontier LLM, the score boost from picking the best open-source agent harness on SWE-bench Verified, Lite, and Multi-SWE-bench. Real public submissions, refreshed hourly, zero auth.
Live: https://holyai.me/agent-uplift/
Why this exists
In May 2026 the AI coding agent community had a quiet realization: the harness matters more than the weights. Take the same model, wrap it in swe-agent versus aider versus cline versus codex-cli, and the SWE-bench Verified score can move by 5 to 30 points. The official SWE-bench leaderboard lists each submission once, in submission order — it does not group by model, it does not surface the delta.
agent-uplift answers a single question per model:
If I run this model, which harness gives me the biggest free score boost, and by how many points?
What you see
- A per-model leaderboard sorted by uplift, descending.
- Each row: model · bare baseline % · best ceiling % · best harness · uplift (signed) · # submissions.
- Click a row → drawer with every (harness, score) pair for that model, plus raw GitHub links to each
metadata.yamlandresults.json. - A second table ranking harnesses by mean uplift delivered across the models they hold ceilings for.
- A "movers" feed of the last 14 days of submissions.
- A one-click 1200×630 OG share card per model — Twitter-ready.
Data sources (NO MOCKS)
Every datapoint is pulled at runtime from a real, public, no-auth HTTP endpoint:
| Source | URL pattern | Refresh |
| --- | --- | --- |
| swe-bench/experiments — Verified | https://api.github.com/repos/swe-bench/experiments/contents/evaluation/verified | every 60 min |
| swe-bench/experiments — Lite | https://api.github.com/repos/swe-bench/experiments/contents/evaluation/lite | every 60 min |
| per-submission metadata | https://raw.githubusercontent.com/swe-bench/experiments/main/evaluation/<bench>/<dir>/metadata.yaml | on directory diff |
| per-submission results | https://raw.githubusercontent.com/swe-bench/experiments/main/evaluation/<bench>/<dir>/results/results.json | on directory diff |
| multi-swe-bench results | https://api.github.com/repos/multi-swe-bench/multi-swe-bench/contents/results | every 60 min |
| harness repo metadata | https://api.github.com/repos/<owner>/<repo> | every 6 hours |
Every response is ETag-cached in SQLite (http_cache table) so the unauthenticated 60 req/hour rate limit is plenty. If a GITHUB_TOKEN env var is set, the limit rises to 5000/hour.
Methodology
- Baseline per model = lowest-scoring submission whose harness is tagged minimal (
agentless,minimal,bare,vanilla,oneshot). - If no minimal-scaffold submission exists for that model, baseline = lowest-scoring submission across all harnesses for that model (the UI labels this case explicitly).
- Ceiling per model = highest-scoring submission across all harnesses for that model.
- Uplift = ceiling − baseline.
- No mocks. No seeded fallback. If a source 404s, the row is dropped — never invented.
Tech
- Node.js 20 + Express
- better-sqlite3 in WAL mode
- node-cron
- helmet + compression
- Vanilla JS SPA (no framework, no build step), dark theme, English UI
- Mounted under
BASE_PATH = /agent-uplift - Zero auth. Every endpoint — read and write — is public.
Running locally
npm install
PORT=4896 node server.js
# open http://localhost:4896/agent-uplift/
Endpoints
GET /agent-uplift/ SPA shell
GET /agent-uplift/health liveness + cron health JSON
GET /agent-uplift/api/uplift per-model uplift leaderboard
GET /agent-uplift/api/model/:model per-model detail + raw submission links
GET /agent-uplift/api/harness/:harness per-harness view
GET /agent-uplift/api/harnesses harness impact table
GET /agent-uplift/api/movers new submissions in window
GET /agent-uplift/api/stats aggregate stats
GET /agent-uplift/api/methodology methodology JSON
GET /agent-uplift/methodology methodology HTML
GET /agent-uplift/share/:model.svg 1200×630 OG share card
POST /agent-uplift/api/refresh trigger an out-of-cycle refresh (rate-limited 1/min/IP)
License
MIT. Part of the Holy AI / Cowork R&D fleet.