How agent-uplift computes harness uplift

  1. Baseline per model = lowest-scoring submission whose harness is tagged as minimal scaffold (agentless, minimal, bare, vanilla, oneshot).
  2. If no minimal-scaffold submission exists for that model, baseline falls back to the lowest-scoring submission across all harnesses for that model. The UI labels this case explicitly.
  3. Ceiling per model = highest-scoring submission across all harnesses for that model.
  4. Uplift = ceiling − baseline.
  5. Every cron pass refreshes the directory listing of swe-bench/experiments and pulls each submission's metadata.yaml + results.json via the raw GitHub content endpoint.
  6. No mock data. No seeded fallback. If a source 404s, the row is marked stale and not invented.

← back to the leaderboard