schema-comply
Live JSON-schema-compliance leaderboard for every major LLM. Does the model actually return what it promised? Refreshed every 4 hours via real probes through OpenRouter. No mocks. No seed rows.
Live: https://holyai.me/schema-comply/
- Health:
GET /schema-comply/health - Leaderboard JSON:
GET /schema-comply/api/leaderboard?window=7d - README: this file
Why this exists
In June 2026 every frontier API ships a "structured outputs" / "JSON schema" mode:
- OpenAI:
response_format: { type: "json_schema", json_schema: { ... } } - Anthropic: enforced via
tool_useblock withinput_schema - Google:
responseSchema+responseMimeType: "application/json" - DeepSeek / Qwen / Mistral / Kimi:
response_format: { type: "json_object" }
The catch: providers advertise the same surface but differ in how strictly the
schema is enforced. Some constrain decoding; some only re-prompt; some silently
emit free-text wrapped in \\\json fences; some omit required fields underoneOf`. Devs ship code that worked yesterday on Claude
recursion or refuse on
and breaks on Gemini.
schema-comply answers one question every 4 hours:
Given a fixed prompt and a fixed JSON schema, what fraction of the time does each major LLM return a response that actually validates?
Sibling Cowork products: tool-use-arena (function-calling correctness via
BFCL), tool-mirage (hallucinated tool names from GitHub issues), cache-arena
(prompt-cache discount), tps-board (measured TPS). This one is the
structured-output axis.
The probe battery
Five fixed schemas live in probes/schemas/*.json. Each ships with a fixed
natural-language prompt, held identical across providers so day-over-day deltas
reflect provider drift, not prompt drift.
| ID | Difficulty | What it tests |
|---|---|---|
| flat | ★ | Three required fields, no extras |
| enum_range | ★★ | String enum + integer range |
| nested_optional | ★★★ | Nested required vs optional under additionalProperties:false |
| array_constraints | ★★★ | minItems / maxItems / uniqueItems / pattern |
| discriminated_union | ★★★★★ | oneOf with two distinct shapes |
For every probe response we record parsed (did JSON.parse succeed?),valid (did AJV 2020-12 validate?), the AJV error array, and a distilled
flag set: extra_fields, missing_required, type_mismatch, enum_violation,format_violation, range_violation, oneOf_mismatch, not_json.
The leaderboard sorts by 7-day compliance rate (rolling). Models with
fewer than 5 probes in the window show n/a and sort to the bottom.
Data sources (every datapoint is fetched at runtime)
| Source | URL | Auth | Cadence | Purpose |
|---|---|---|---|---|
| OpenRouter chat completions | POST https://openrouter.ai/api/v1/chat/completions | Bearer OPENROUTER_API_KEY | every 4 h, ≤1 probe per (model × schema), budget-capped at $0.50/day | The probe itself |
| OpenRouter generation stats | GET https://openrouter.ai/api/v1/generation?id=<id> | Bearer OPENROUTER_API_KEY | once per probe (~1.2 s after completion) | Authoritative cost, latency, provider name |
| OpenRouter model catalogue | GET https://openrouter.ai/api/v1/models | none | every 6 h | Enroll/disable. Model is eligible iff supported_parameters includes structured_outputs OR response_format. |
| Provider doc pages | OpenAI / Anthropic / Google / DeepSeek / Mistral JSON-mode docs | none | every 30 min | Reachability only — we never scrape numbers |
| arXiv cs.CL daily RSS | http://export.arxiv.org/rss/cs.CL | none | daily | "Recent research" sidebar — title + link only, filtered by keywords |
If a fetch fails the failure is logged to fetch_log and the previous rows are
left as-is. No fallback values. No Math.random() jitter.
Read-only / live modes
If OPENROUTER_API_KEY is missing or set to __INJECT_FROM_VAULT__, the
service starts in read-only mode: the probe and catalogue crons no-op
(logging disabled:no_key to fetch_log), and the UI serves whatever
historical snapshots are in SQLite. The arXiv + doc-page crons still run.
This is what lets the server boot cleanly in any environment (Cowork sandbox,
RNDLAB host before vault injection, dev laptop). The orchestrator injects the
real key on deploy and probes begin within the next 4-hour boundary (a startup
delay also fires one tick ~30 s after boot).
Budget cap
Hard ceiling on daily spend stored in settings.daily_budget_usd_cents
(default 50 cents). Before every probe the cron sums total_cost across the
current UTC day; if the sum ≥ cap, the tick exits and logs budget_exhausted.
Editable at runtime via POST /api/settings — like everything else, no auth.
API
All endpoints under /schema-comply. JSON only (except /, /badge/:model
which is SVG).
| Method | Path | Description |
|---|---|---|
| GET | /health | Liveness + per-source freshness, mode, budget |
| GET | /api/leaderboard?window=7d&schema=<id> | Ranked rows |
| GET | /api/models | Tracked models (filter: ?provider=, ?active=) |
| GET | /api/models/:id | One model + last 50 probes + per-schema breakdown + 30-day series |
| GET | /api/schemas | The five test schemas + their prompt |
| GET | /api/probes/recent?limit=50 | Recent probes (raw content truncated) |
| GET | /api/probes/:id | One probe, full raw response + AJV errors |
| GET | /api/fetch-log?limit=200 | Last fetch attempts |
| GET | /api/stats | Totals + budget + mode |
| GET | /api/papers | arXiv sidebar entries |
| GET | /api/settings / POST /api/settings | Read/update settings (no auth) |
| POST | /api/probes | Manual probe {model_id, schema_id} — honors budget cap |
| GET | /badge/:model | SVG shield with 7-day compliance rate |
Stack
- Node.js 20+ / Express 4
- better-sqlite3 11 (WAL mode +
synchronous=NORMAL+ 64 MiB journal cap) - AJV 8 with
ajv-formats(JSON Schema draft 2020-12 — required foroneOf+prefixItems) - node-cron 3
- helmet 7 + compression 1 + cors 2
- Vanilla JS SPA, dark theme, English UI labels (international audience —
holyai.meships to a global gallery)
Run locally
npm install
cp .env.example .env # default port 4910
# Set OPENROUTER_API_KEY=sk-or-... to enable live probes.
npm start
Visit http://localhost:4910/schema-comply/.
Layout
server.js # Express app, route mounting, cron registration
db.js # SQLite init + prepared statements
lib/
validator.js # AJV 2020-12 setup + error-class distillation
budget.js # Daily-budget guard
cron.js # Registers all cron jobs
log.js # fetch_log helper
fetchers/
openrouter_catalogue.js # GET /api/v1/models, enroll/disable
openrouter_probe.js # POST /chat/completions + GET /generation, AJV validate
arxiv.js # cs.CL RSS sidebar
doc_probe.js # Provider doc-page reachability
aggregate.js # Daily roll-up + 60-day vacuum
routes/
health.js leaderboard.js models.js schemas.js probes.js
stats.js fetch_log.js papers.js settings.js badge.js
probes/schemas/ # The five fixed JSON-schema probes
public/ # SPA: index.html + app.js + style.css
Auth policy
There is no auth. Every endpoint — read and write — is public, by design.
Arda inspects everything live; we do not want a password between him and the
data.
Honesty pledge
- Zero seed rows. DB starts empty.
- Zero mocked numbers. Every value on the page traces to a row in
probes, -
models,arxiv_papers, orfetch_log. - Zero
Math.random()jitter. - If a source is down, the page shows "Awaiting first probe…" or "Failed to
- load" — it never invents data.
- Every fetch attempt — success or failure — is recorded in
fetch_logand - exposed at
/api/fetch-log.
License
MIT.