judge-floor

Live cost + adoption leaderboard for LLM-as-judge models used in automated eval pipelines.

Refreshed continuously from four public sources:

| Source | Endpoint | Frequency |
|---|---|---|
| OpenRouter | https://openrouter.ai/api/v1/models | every 15 min |
| HuggingFace | https://huggingface.co/api/models/{id} | hourly |
| GitHub Releases | https://api.github.com/repos/{owner}/{repo}/releases | daily 04:17 UTC |
| arxiv | https://export.arxiv.org/api/query?... | every 6 hours |

No mock data. No seeds. No Math.random(). Every metric on the leaderboard is the latest value pulled from one of the four endpoints above. Hosted-judge-API per-call prices are hand-versioned in lib/catalog.js and the UI shows a pricing_last_verified_at timestamp so users know it's a curated number, not live.

Live URL (after deploy): https://holyai.me/judge-floor/

Why this exists (May 2026)

Three things in May 2026 made LLM-as-judge picking a real, unsolved pain:

Coding Agent Index (Artificial Analysis, May 2026) — first public benchmark scoring full agent stacks (model + harness); judge-driven grading pipeline.
Claude Code /goal command (May 2026) — autonomous completion with explicit validator-model role.
SkillsBench (May 2026) — judge-scored skill-augmentation eval, ~6 judge calls per skill graded.

Pick the wrong judge and your eval pipeline either costs 100× more than it should (GPT-5 for 50-token verdicts) or disagrees with humans 30% of the time (cheap generalists on rubric tasks). judge-floor surfaces the cost / adoption / freshness trade-offs side by side.

Stack

Node.js 20 (ESM)
Express 4
better-sqlite3 (WAL mode)
node-cron
helmet + compression
Vanilla JS SPA, dark theme, single file per asset, no build step

Run locally

npm install
cp .env.example .env   # optional — anonymous works for all four sources
npm start
# open http://localhost:4889/judge-floor/

The server boots immediately and binds the port; the first cron pass for all four sources runs in parallel 5 seconds after boot, so the leaderboard populates within ~30 seconds.

Endpoints

All routes are mounted under /judge-floor. Every endpoint is public — no Basic Auth, no API keys, no admin pages.

| Method | Path | Purpose |
|---|---|---|
| GET | /judge-floor/ | SPA index |
| GET | /judge-floor/health | {ok:true, time, product, port} — orchestrator smoke check |
| GET | /judge-floor/model/:slug | Per-model detail page (same SPA shell) |
| GET | /judge-floor/digest/:week_iso | Shareable weekly digest card (OG meta tags) |
| GET | /judge-floor/sources | Data-source transparency page |
| GET | /judge-floor/api/health | Source-by-source freshness JSON |
| GET | /judge-floor/api/judges?judge_prompt_tokens=1500&judgement_tokens=50&judgements_per_run=20&runs_per_day=100 | Leaderboard JSON, sorted by $ per judgement ascending |
| GET | /judge-floor/api/simulate?... | Same cost simulator as /api/judges, only rows with prices, sorted |
| GET | /judge-floor/api/judge/:slug | Per-model detail with price/adoption history + releases |
| GET | /judge-floor/api/movers?window=7d&type=price\|adoption | Top 7-day gainers/losers |
| GET | /judge-floor/api/arxiv?days=30 | Recent judge-model papers |
| GET | /judge-floor/api/releases | Recent GitHub releases across all tracked repos |
| GET | /judge-floor/api/digest/:week_iso | Digest JSON payload |
| GET | /judge-floor/api/sources | Source-level fetch stats |

All responses set cache-control: max-age=60.

How we pick judges

The catalog (35 entries, in lib/catalog.js) covers three classes:

Frontier generalist (15) — appears on OpenRouter, supports text→text, current major flagship. We snapshot OpenRouter prices for these every 15 minutes.
Specialist OSS (15) — weight artifact published on HuggingFace AND positioned by its authors as an automated evaluator / reward model / critic / judge. We snapshot HuggingFace download/like counts hourly. If there's a GitHub repo, we poll releases daily.
Hosted judge API (5) — commercial product exposing a "judge" endpoint with public pricing. Pricing is hand-versioned and the UI shows the pricing_last_verified_at timestamp.

Inclusion criteria are documented in the source.

What we don't do

We do not re-run JudgeBench / Judge Arena evaluations ourselves. We surface the public catalogue and let the user click through to the original leaderboard for the latest quality scores.
No alerting or email digests. The Monday digest is a static page; users RSS-subscribe by hitting /digest/<week>.
No accounts or saved profiles. The cost simulator is pure URL state.

Data integrity

Every data point on the public site can be traced back to a real fetch:

Frontier price → openrouter_snapshots row from the last 15-minute cron pull.
Specialist adoption → hf_snapshots row from the last hourly cron pull.
Recent releases → gh_releases row from the last daily poll.
Recent papers → arxiv_papers row from the last 6-hourly query.

Fetch transparency is exposed at /judge-floor/sources — last OK timestamp, 24h success/fail counts, and average response time per source.

Retention

openrouter_snapshots — 60 days
hf_snapshots — 90 days
fetch_log — 14 days
gh_releases, arxiv_papers, digests — kept forever

Pruned by a daily cron at 03:41 UTC.

License

MIT.

Author

Cowork (Claude Opus 4.7) · Holy AI · May 28 2026