solve-tax
The live $-per-SWE-bench-solve leaderboard. Real benchmark scores joined with current API pricing — sorted by the dollar cost to ship one resolved ticket.
What this is
When buyers shop for a coding agent they usually see two separate numbers:
- A resolve rate on a benchmark like SWE-bench Verified — e.g. "Claude Opus 4.5 + live-SWE-agent resolves 79.2% of 500 instances".
- An API price — e.g. "$3 / Mtok input, $15 / Mtok output".
Neither answers the buyer's real question: "how much do I burn per ticket if I throw 1,000 of them at this agent?"
solve-tax joins the two halves into a single, sortable, live leaderboard. Lower $/solve = better.
Data sources (all live, all public)
| Source | URL | Refresh |
|---|---|---|
| SWE-bench Verified submissions | github.com/SWE-bench/experiments/evaluation/verified (metadata.yaml + results/results.json per submission) | every 6 hours |
| API pricing | openrouter.ai/api/v1/models (~300 models, raw $/token) | every 2 hours |
| Daily $/solve snapshot | local DB rollup | once a day |
No hardcoded scores. No mock pricing. No Math.random() jitter. If a submission has no results.json yet, it shows up with 0 resolved and is parked at the bottom of the table until SWE-bench publishes its score.
Formula
cost_per_attempt = in_tok × $/in + out_tok × $/out # $ / single try
cost_per_solve = cost_per_attempt × 500 / resolved # 500 = SWE-bench Verified instance count
Token budget defaults to 250,000 input + 30,000 output tokens per attempt — the rough average observed across published agent traces on SWE-bench Verified. Both knobs are user-adjustable at the top of the page and via ?in_tok= / ?out_tok= on the API.
Endpoints
All endpoints are public (no auth) and mount under BASE_PATH=/solve-tax.
GET /solve-tax/health
GET /solve-tax/api/leaderboard?in_tok=&out_tok=&bash_only=&os_model=&min_resolve=&org=&limit=
GET /solve-tax/api/submissions
GET /solve-tax/api/submissions/:dir
GET /solve-tax/api/models
GET /solve-tax/api/models/:id
GET /solve-tax/api/repos
GET /solve-tax/api/repos/:repo
GET /solve-tax/api/biggest-drops
GET /solve-tax/api/recompute?in_tok=&out_tok=
GET /solve-tax/api/refresh-log?limit=
GET /solve-tax/api/stats
GET /solve-tax/api/refresh/:source # source: swebench | openrouter | rematch | snapshot
GET /solve-tax/ # SPA
Running locally
npm install
PORT=4841 BASE_PATH=/solve-tax node server.js
# then open http://localhost:4841/solve-tax/
First boot fetches pricing immediately and kicks off a SWE-bench sweep in the background; the leaderboard fills in within a few minutes. Set SKIP_BOOT_SWEEP=1 to disable that behaviour while developing offline.
Stack
- Node 20 + Express 4
better-sqlite3with WAL mode (singlesolve-tax.dbfile)node-cronfor the periodic refresheshelmet,compression,corsjs-yamlfor parsing SWE-benchmetadata.yaml- Vanilla JS SPA, dark theme, English labels
Caveats
- The token budget is a modelled estimate, not a per-call meter. Real costs swing with prompt design, scaffold, cache-hit rate, and retries.
- Model matching is fuzzy. When confidence is low, the row still appears but the matched-model column shows the rationale so you can sanity-check it. Manual overrides live in
config/aliases.json. - Submissions whose
results.jsonisn't published yet are kept around with 0 resolved so the dashboard still shows them, but they sort to the bottom.
License
MIT