judge-floor
Live cost + adoption leaderboard for LLM-as-judge models used in automated eval pipelines.
Refreshed continuously from four public sources:
| Source | Endpoint | Frequency |
|---|---|---|
| OpenRouter | https://openrouter.ai/api/v1/models | every 15 min |
| HuggingFace | https://huggingface.co/api/models/{id} | hourly |
| GitHub Releases | https://api.github.com/repos/{owner}/{repo}/releases | daily 04:17 UTC |
| arxiv | https://export.arxiv.org/api/query?... | every 6 hours |
No mock data. No seeds. No Math.random(). Every metric on the leaderboard is the latest value pulled from one of the four endpoints above. Hosted-judge-API per-call prices are hand-versioned in lib/catalog.js and the UI shows a pricing_last_verified_at timestamp so users know it's a curated number, not live.
Live URL (after deploy): https://holyai.me/judge-floor/
Why this exists (May 2026)
Three things in May 2026 made LLM-as-judge picking a real, unsolved pain:
- Coding Agent Index (Artificial Analysis, May 2026) — first public benchmark scoring full agent stacks (model + harness); judge-driven grading pipeline.
- Claude Code
/goalcommand (May 2026) — autonomous completion with explicit validator-model role. - SkillsBench (May 2026) — judge-scored skill-augmentation eval, ~6 judge calls per skill graded.
Pick the wrong judge and your eval pipeline either costs 100× more than it should (GPT-5 for 50-token verdicts) or disagrees with humans 30% of the time (cheap generalists on rubric tasks). judge-floor surfaces the cost / adoption / freshness trade-offs side by side.
Stack
- Node.js 20 (ESM)
- Express 4
- better-sqlite3 (WAL mode)
- node-cron
- helmet + compression
- Vanilla JS SPA, dark theme, single file per asset, no build step
Run locally
npm install
cp .env.example .env # optional — anonymous works for all four sources
npm start
# open http://localhost:4889/judge-floor/
The server boots immediately and binds the port; the first cron pass for all four sources runs in parallel 5 seconds after boot, so the leaderboard populates within ~30 seconds.
Endpoints
All routes are mounted under /judge-floor. Every endpoint is public — no Basic Auth, no API keys, no admin pages.
| Method | Path | Purpose |
|---|---|---|
| GET | /judge-floor/ | SPA index |
| GET | /judge-floor/health | {ok:true, time, product, port} — orchestrator smoke check |
| GET | /judge-floor/model/:slug | Per-model detail page (same SPA shell) |
| GET | /judge-floor/digest/:week_iso | Shareable weekly digest card (OG meta tags) |
| GET | /judge-floor/sources | Data-source transparency page |
| GET | /judge-floor/api/health | Source-by-source freshness JSON |
| GET | /judge-floor/api/judges?judge_prompt_tokens=1500&judgement_tokens=50&judgements_per_run=20&runs_per_day=100 | Leaderboard JSON, sorted by $ per judgement ascending |
| GET | /judge-floor/api/simulate?... | Same cost simulator as /api/judges, only rows with prices, sorted |
| GET | /judge-floor/api/judge/:slug | Per-model detail with price/adoption history + releases |
| GET | /judge-floor/api/movers?window=7d&type=price\|adoption | Top 7-day gainers/losers |
| GET | /judge-floor/api/arxiv?days=30 | Recent judge-model papers |
| GET | /judge-floor/api/releases | Recent GitHub releases across all tracked repos |
| GET | /judge-floor/api/digest/:week_iso | Digest JSON payload |
| GET | /judge-floor/api/sources | Source-level fetch stats |
All responses set cache-control: max-age=60.
How we pick judges
The catalog (35 entries, in lib/catalog.js) covers three classes:
- Frontier generalist (15) — appears on OpenRouter, supports text→text, current major flagship. We snapshot OpenRouter prices for these every 15 minutes.
- Specialist OSS (15) — weight artifact published on HuggingFace AND positioned by its authors as an automated evaluator / reward model / critic / judge. We snapshot HuggingFace download/like counts hourly. If there's a GitHub repo, we poll releases daily.
- Hosted judge API (5) — commercial product exposing a "judge" endpoint with public pricing. Pricing is hand-versioned and the UI shows the
pricing_last_verified_attimestamp.
Inclusion criteria are documented in the source.
What we don't do
- We do not re-run JudgeBench / Judge Arena evaluations ourselves. We surface the public catalogue and let the user click through to the original leaderboard for the latest quality scores.
- No alerting or email digests. The Monday digest is a static page; users RSS-subscribe by hitting
/digest/<week>. - No accounts or saved profiles. The cost simulator is pure URL state.
Data integrity
Every data point on the public site can be traced back to a real fetch:
- Frontier price →
openrouter_snapshotsrow from the last 15-minute cron pull. - Specialist adoption →
hf_snapshotsrow from the last hourly cron pull. - Recent releases →
gh_releasesrow from the last daily poll. - Recent papers →
arxiv_papersrow from the last 6-hourly query.
Fetch transparency is exposed at /judge-floor/sources — last OK timestamp, 24h success/fail counts, and average response time per source.
Retention
openrouter_snapshots— 60 dayshf_snapshots— 90 daysfetch_log— 14 daysgh_releases,arxiv_papers,digests— kept forever
Pruned by a daily cron at 03:41 UTC.
License
MIT.
Author
Cowork (Claude Opus 4.7) · Holy AI · May 28 2026