scribe-trials

A public hallucination benchmark for AI medical scribes. Paste your scribe's SOAP-note output for curated synthetic patient encounters and get a graded report on dropped meds, missed mental-health flags, and fabricated findings — with drug strings validated live against RxNorm and openFDA NDC.

Synthetic data only. Do not paste real PHI. This is a benchmark, not a HIPAA-compliant clinical tool.

What problem it solves

A 2024 Ontario Auditor General audit of 20 AI medical-scribe systems found that 9/20 fabricated treatment plans and 12/20 inserted wrong drug information. Clinic IT directors, AI compliance leads, and scribe vendors all need a defensible, public eval before signing a contract. None existed. This is that eval.

How it works

Pick a trial. 8 hand-authored synthetic patient encounters in corpus/scripts/*.yaml, each with structured ground truth: medications (name + dose + frequency + route + action), allergies, mental-health flags, deferred diagnoses, and fabrication traps (phrases that must NOT appear because the patient explicitly denied them).
Paste scribe output. Whatever your AI scribe produced — SOAP note, summary, structured JSON.
Grade. A rule-based grader extracts drug mentions, dose, and frequency from the scribe output; normalises every drug string against RxNorm approximateTerm + properties; cross-checks dosage forms against openFDA NDC; matches against ground truth; scores three categories.

Scoring (category weights are per-script and live in each YAML):

Drug Accuracy — fraction of expected meds correctly named, dosed, and frequenced (partial credit for partial matches).
Mental-Health Capture — fraction of mental-health flags whose keywords appear.
Fabrication Score — 1 - (triggered traps / total traps). Phantom drugs (a drug in the scribe output that has no good RxNorm match and isn't in ground truth) each subtract 0.15.

A vendor earns a hotlinkable Trial Pass SVG badge when their average across ≥3 submissions is ≥80%.

Stack

Node.js ≥22
Express, helmet, compression
better-sqlite3 (WAL)
js-yaml, nanoid
Vanilla JS frontend, dark theme, hash router

Endpoints

All mounted under /scribe-trials. No auth.

| Method | Path | Description |
|---|---|---|
| GET | /scribe-trials/health | {ok:true} |
| GET | /scribe-trials/api/scripts | List of trials (id, title, difficulty, summary, version_hash). Ground truth not exposed. |
| GET | /scribe-trials/api/scripts/:id | Encounter text + instructions. Ground truth still not exposed. |
| POST | /scribe-trials/api/submit | Body: {script_id, vendor, scribe_output}. Grades synchronously, persists, returns full result. |
| GET | /scribe-trials/api/submissions/:id | Full graded result (ground truth revealed post-grading for transparency). |
| GET | /scribe-trials/api/leaderboard | Vendors with ≥3 runs, sorted by avg score. |
| GET | /scribe-trials/api/badge/:vendor.svg | shields.io-style SVG. 200 if vendor avg ≥80%; 404 SVG otherwise. |
| GET | /scribe-trials/ | SPA shell. |

Data sources

| Source | URL | Refresh | Cache |
|---|---|---|---|
| RxNorm approximate term match | https://rxnav.nlm.nih.gov/REST/approximateTerm.json | per drug mention at grading | 7 days in SQLite |
| RxNorm rxcui properties | https://rxnav.nlm.nih.gov/REST/rxcui/{rxcui}/properties.json | per matched RxCUI | 7 days |
| openFDA NDC Directory | https://api.fda.gov/drug/ndc.json | per matched drug | 7 days |
| Curated test corpus | corpus/scripts/*.yaml (in-repo) | loaded at server start | n/a |

When RxNorm or openFDA is unreachable, grading does not crash — drugs are still scored against ground truth via fuzzy string match, and the response includes a reason field on each validation explaining the fallback.

Run locally

npm install
PORT=4779 node server.js

Then visit http://localhost:4779/scribe-trials/.

Smoke test

# Health
curl -s http://localhost:4779/scribe-trials/health
# -> {"ok":true,...}

# List trials
curl -s http://localhost:4779/scribe-trials/api/scripts | jq 'length'
# -> 8

# Grade a deliberately broken submission
curl -s -X POST http://localhost:4779/scribe-trials/api/submit \
-H 'content-type: application/json' \
-d '{"script_id":"03-depression-suicidal-ideation","vendor":"test","scribe_output":"Patient on sertaline 50mg daily. Reports insomnia. No SI."}' \
| jq '{total:.score_total, drug:.score_drug, mental:.score_mental, fab:.score_fabrication}'

# Confirm RxNorm was actually hit
sqlite3 data.db "SELECT source, count(*) FROM drug_cache GROUP BY source;"
```

Out of scope

Authentication, accounts, sessions
Audio transcription / EHR / FHIR integration
HIPAA-compliant storage
LLM-based grading (deterministic rule-based only)
Mobile-first UI (desktop-first; doesn't break on mobile)
Payments / billing

Adding your own trials

Drop a new YAML file in corpus/scripts/ matching the schema in any existing file. Restart the server. The corpus is hashed at load time; the hash appears on every trial card and submission so vendors can prove which version they scored against.

Schema (per script): id, title, difficulty (easy|medium|hard), summary, encounter_text, instructions, ground_truth { meds[], allergies[], mental_flags[], deferred_dx[], fabrication_traps[] }, weights { drug, mental, fabrication } (must sum to 1.0).