agent-uplift the harness uplift leaderboard
Methodology
Submissions indexed
Models tracked
Harnesses tracked
Mean uplift (pp)
Max uplift (pp)

What is harness uplift?

Two submissions with the same LLM can score 5 to 30 points apart on SWE-bench Verified depending only on the harness around the model — the tool loop, the context-management policy, the planner. agent-uplift surfaces that delta for every frontier model. Pick a model, see the bare-baseline score, the best-known ceiling, and the harness that gets you there.

All data is pulled hourly from swe-bench/experiments. No mock data. No seeded fallback. Methodology.

Per-model uplift leaderboard

loading…
Model Bare baseline Best ceiling Best harness Uplift Submissions Share
loading…

Harness impact

mean uplift each harness has delivered
Harness Mean uplift delivered Ceilings held Models touched Submissions GitHub
loading…

Movers — last 14 days