solve-tax

live $-per-SWE-bench-solve leaderboard

# Submission Org Model tag → matched Resolve % $ / attempt $ / solve ▴ Submitted

$/solve = ($/attempt × 500) / resolved. $/attempt = in_tok × $/in + out_tok × $/out. Models without a pricing match are listed at the bottom.

#SubmissionOrgModel Solved in repoOverall %$/Mtok in$/Mtok out
Model$/Mtok in$/Mtok out ContextRefreshed

How $/solve is computed

For every submission on the public SWE-bench Verified leaderboard, we read the resolved-instance count from SWE-bench/experiments/evaluation/verified/<dir>/results/results.json. SWE-bench Verified contains exactly 500 human-validated instances, so the resolve rate is resolved / 500.

The submission's tags.model field is fuzzy-matched against an OpenRouter model id (cache-aware, with manual overrides in config/aliases.json). When a match exists, we look up the model's current $/token from OpenRouter's public models endpoint.

The cost per attempt assumes a fixed per-attempt budget — defaulted to 250,000 input tokens and 30,000 output tokens, the rough average reported by published agent traces on SWE-bench Verified runs. Both knobs are user-adjustable at the top of the leaderboard tab; the table re-renders with your numbers.

Then:

cost_per_attempt = in_tok × $/in + out_tok × $/out
cost_per_solve   = cost_per_attempt × 500 / resolved

Lower is better. Submissions that resolved zero instances or whose model couldn't be matched to a current OpenRouter price are listed at the bottom of the table.

Data sources

  • SWE-bench scores: github.com/SWE-bench/experiments (refreshed every 6h)
  • API pricing: openrouter.ai/api/v1/models (refreshed every 2h)
  • Daily $/solve snapshot for trend lines

What this is not

It's an estimate, not a per-call meter. Real costs depend on prompt design, scaffold, cache-hit rate, retry policy, and the prevailing OpenRouter route. The formula is exposed so anyone can plug their own numbers in.