schema-comply

Live JSON-schema-compliance leaderboard for every major LLM. Does the model actually return what it promised? Refreshed every 4 hours via real probes through OpenRouter. No mocks. No seed rows.

Live: https://holyai.me/schema-comply/

Health: GET /schema-comply/health
Leaderboard JSON: GET /schema-comply/api/leaderboard?window=7d
README: this file

Why this exists

In June 2026 every frontier API ships a "structured outputs" / "JSON schema" mode:

OpenAI: response_format: { type: "json_schema", json_schema: { ... } }
Anthropic: enforced via tool_use block with input_schema
Google: responseSchema + responseMimeType: "application/json"
DeepSeek / Qwen / Mistral / Kimi: response_format: { type: "json_object" }

The catch: providers advertise the same surface but differ in how strictly the
schema is enforced. Some constrain decoding; some only re-prompt; some silently
emit free-text wrapped in \\\jsonfences; some omit required fields under recursion or refuse ononeOf`. Devs ship code that worked yesterday on Claude
and breaks on Gemini.

schema-comply answers one question every 4 hours:

Given a fixed prompt and a fixed JSON schema, what fraction of the time does each major LLM return a response that actually validates?

Sibling Cowork products: tool-use-arena (function-calling correctness via
BFCL), tool-mirage (hallucinated tool names from GitHub issues), cache-arena
(prompt-cache discount), tps-board (measured TPS). This one is the
structured-output axis.

The probe battery

Five fixed schemas live in probes/schemas/*.json. Each ships with a fixed
natural-language prompt, held identical across providers so day-over-day deltas
reflect provider drift, not prompt drift.

| ID | Difficulty | What it tests |
|---|---|---|
| flat | ★ | Three required fields, no extras |
| enum_range | ★★ | String enum + integer range |
| nested_optional | ★★★ | Nested required vs optional under additionalProperties:false |
| array_constraints | ★★★ | minItems / maxItems / uniqueItems / pattern |
| discriminated_union | ★★★★★ | oneOf with two distinct shapes |

For every probe response we record parsed (did JSON.parse succeed?),
valid (did AJV 2020-12 validate?), the AJV error array, and a distilled
flag set: extra_fields, missing_required, type_mismatch, enum_violation,
format_violation, range_violation, oneOf_mismatch, not_json.

The leaderboard sorts by 7-day compliance rate (rolling). Models with
fewer than 5 probes in the window show n/a and sort to the bottom.

Data sources (every datapoint is fetched at runtime)

| Source | URL | Auth | Cadence | Purpose |
|---|---|---|---|---|
| OpenRouter chat completions | POST https://openrouter.ai/api/v1/chat/completions | Bearer OPENROUTER_API_KEY | every 4 h, ≤1 probe per (model × schema), budget-capped at $0.50/day | The probe itself |
| OpenRouter generation stats | GET https://openrouter.ai/api/v1/generation?id=<id> | Bearer OPENROUTER_API_KEY | once per probe (~1.2 s after completion) | Authoritative cost, latency, provider name |
| OpenRouter model catalogue | GET https://openrouter.ai/api/v1/models | none | every 6 h | Enroll/disable. Model is eligible iff supported_parameters includes structured_outputs OR response_format. |
| Provider doc pages | OpenAI / Anthropic / Google / DeepSeek / Mistral JSON-mode docs | none | every 30 min | Reachability only — we never scrape numbers |
| arXiv cs.CL daily RSS | http://export.arxiv.org/rss/cs.CL | none | daily | "Recent research" sidebar — title + link only, filtered by keywords |

If a fetch fails the failure is logged to fetch_log and the previous rows are
left as-is. No fallback values. No Math.random() jitter.

Read-only / live modes

If OPENROUTER_API_KEY is missing or set to __INJECT_FROM_VAULT__, the
service starts in read-only mode: the probe and catalogue crons no-op
(logging disabled:no_key to fetch_log), and the UI serves whatever
historical snapshots are in SQLite. The arXiv + doc-page crons still run.

This is what lets the server boot cleanly in any environment (Cowork sandbox,
RNDLAB host before vault injection, dev laptop). The orchestrator injects the
real key on deploy and probes begin within the next 4-hour boundary (a startup
delay also fires one tick ~30 s after boot).

Budget cap

Hard ceiling on daily spend stored in settings.daily_budget_usd_cents
(default 50 cents). Before every probe the cron sums total_cost across the
current UTC day; if the sum ≥ cap, the tick exits and logs budget_exhausted.
Editable at runtime via POST /api/settings — like everything else, no auth.

API

All endpoints under /schema-comply. JSON only (except /, /badge/:model
which is SVG).

| Method | Path | Description |
|---|---|---|
| GET | /health | Liveness + per-source freshness, mode, budget |
| GET | /api/leaderboard?window=7d&schema=<id> | Ranked rows |
| GET | /api/models | Tracked models (filter: ?provider=, ?active=) |
| GET | /api/models/:id | One model + last 50 probes + per-schema breakdown + 30-day series |
| GET | /api/schemas | The five test schemas + their prompt |
| GET | /api/probes/recent?limit=50 | Recent probes (raw content truncated) |
| GET | /api/probes/:id | One probe, full raw response + AJV errors |
| GET | /api/fetch-log?limit=200 | Last fetch attempts |
| GET | /api/stats | Totals + budget + mode |
| GET | /api/papers | arXiv sidebar entries |
| GET | /api/settings / POST /api/settings | Read/update settings (no auth) |
| POST | /api/probes | Manual probe {model_id, schema_id} — honors budget cap |
| GET | /badge/:model | SVG shield with 7-day compliance rate |

Stack

Node.js 20+ / Express 4
better-sqlite3 11 (WAL mode + synchronous=NORMAL + 64 MiB journal cap)
AJV 8 with ajv-formats (JSON Schema draft 2020-12 — required for oneOf + prefixItems)
node-cron 3
helmet 7 + compression 1 + cors 2
Vanilla JS SPA, dark theme, English UI labels (international audience — holyai.me ships to a global gallery)

Run locally

npm install
cp .env.example .env       # default port 4910
# Set OPENROUTER_API_KEY=sk-or-... to enable live probes.
npm start

Visit http://localhost:4910/schema-comply/.

Layout

server.js                     # Express app, route mounting, cron registration
db.js                         # SQLite init + prepared statements
lib/
  validator.js                # AJV 2020-12 setup + error-class distillation
  budget.js                   # Daily-budget guard
  cron.js                     # Registers all cron jobs
  log.js                      # fetch_log helper
fetchers/
  openrouter_catalogue.js     # GET /api/v1/models, enroll/disable
  openrouter_probe.js         # POST /chat/completions + GET /generation, AJV validate
  arxiv.js                    # cs.CL RSS sidebar
  doc_probe.js                # Provider doc-page reachability
  aggregate.js                # Daily roll-up + 60-day vacuum
routes/
  health.js  leaderboard.js  models.js  schemas.js  probes.js
  stats.js   fetch_log.js    papers.js   settings.js  badge.js
probes/schemas/               # The five fixed JSON-schema probes
public/                       # SPA: index.html + app.js + style.css

Auth policy

There is no auth. Every endpoint — read and write — is public, by design.
Arda inspects everything live; we do not want a password between him and the
data.

Honesty pledge

Zero seed rows. DB starts empty.
Zero mocked numbers. Every value on the page traces to a row in probes,
models, arxiv_papers, or fetch_log.
Zero Math.random() jitter.
If a source is down, the page shows "Awaiting first probe…" or "Failed to
load" — it never invents data.
Every fetch attempt — success or failure — is recorded in fetch_log and
exposed at /api/fetch-log.

License

MIT.