solve-tax

The live $-per-SWE-bench-solve leaderboard. Real benchmark scores joined with current API pricing — sorted by the dollar cost to ship one resolved ticket.

What this is

When buyers shop for a coding agent they usually see two separate numbers:

A resolve rate on a benchmark like SWE-bench Verified — e.g. "Claude Opus 4.5 + live-SWE-agent resolves 79.2% of 500 instances".
An API price — e.g. "$3 / Mtok input, $15 / Mtok output".

Neither answers the buyer's real question: "how much do I burn per ticket if I throw 1,000 of them at this agent?"

solve-tax joins the two halves into a single, sortable, live leaderboard. Lower $/solve = better.

Data sources (all live, all public)

| Source | URL | Refresh |
|---|---|---|
| SWE-bench Verified submissions | github.com/SWE-bench/experiments/evaluation/verified (metadata.yaml + results/results.json per submission) | every 6 hours |
| API pricing | openrouter.ai/api/v1/models (~300 models, raw $/token) | every 2 hours |
| Daily $/solve snapshot | local DB rollup | once a day |

No hardcoded scores. No mock pricing. No Math.random() jitter. If a submission has no results.json yet, it shows up with 0 resolved and is parked at the bottom of the table until SWE-bench publishes its score.

Formula

cost_per_attempt = in_tok × $/in + out_tok × $/out          # $ / single try
cost_per_solve   = cost_per_attempt × 500 / resolved        # 500 = SWE-bench Verified instance count

Token budget defaults to 250,000 input + 30,000 output tokens per attempt — the rough average observed across published agent traces on SWE-bench Verified. Both knobs are user-adjustable at the top of the page and via ?in_tok= / ?out_tok= on the API.

Endpoints

All endpoints are public (no auth) and mount under BASE_PATH=/solve-tax.

GET /solve-tax/health
GET /solve-tax/api/leaderboard?in_tok=&out_tok=&bash_only=&os_model=&min_resolve=&org=&limit=
GET /solve-tax/api/submissions
GET /solve-tax/api/submissions/:dir
GET /solve-tax/api/models
GET /solve-tax/api/models/:id
GET /solve-tax/api/repos
GET /solve-tax/api/repos/:repo
GET /solve-tax/api/biggest-drops
GET /solve-tax/api/recompute?in_tok=&out_tok=
GET /solve-tax/api/refresh-log?limit=
GET /solve-tax/api/stats
GET /solve-tax/api/refresh/:source         # source: swebench | openrouter | rematch | snapshot
GET /solve-tax/                            # SPA

Running locally

npm install
PORT=4841 BASE_PATH=/solve-tax node server.js
# then open http://localhost:4841/solve-tax/

First boot fetches pricing immediately and kicks off a SWE-bench sweep in the background; the leaderboard fills in within a few minutes. Set SKIP_BOOT_SWEEP=1 to disable that behaviour while developing offline.

Stack

Node 20 + Express 4
better-sqlite3 with WAL mode (single solve-tax.db file)
node-cron for the periodic refreshes
helmet, compression, cors
js-yaml for parsing SWE-bench metadata.yaml
Vanilla JS SPA, dark theme, English labels

Caveats

The token budget is a modelled estimate, not a per-call meter. Real costs swing with prompt design, scaffold, cache-hit rate, and retries.
Model matching is fuzzy. When confidence is low, the row still appears but the matched-model column shows the rationale so you can sanity-check it. Manual overrides live in config/aliases.json.
Submissions whose results.json isn't published yet are kept around with 0 resolved so the dashboard still shows them, but they sort to the bottom.

License

MIT