scribe-trials
A public hallucination benchmark for AI medical scribes. Paste your scribe's SOAP-note output for curated synthetic patient encounters and get a graded report on dropped meds, missed mental-health flags, and fabricated findings — with drug strings validated live against RxNorm and openFDA NDC.
Synthetic data only. Do not paste real PHI. This is a benchmark, not a HIPAA-compliant clinical tool.
What problem it solves
A 2024 Ontario Auditor General audit of 20 AI medical-scribe systems found that 9/20 fabricated treatment plans and 12/20 inserted wrong drug information. Clinic IT directors, AI compliance leads, and scribe vendors all need a defensible, public eval before signing a contract. None existed. This is that eval.
How it works
- Pick a trial. 8 hand-authored synthetic patient encounters in
corpus/scripts/*.yaml, each with structured ground truth: medications (name + dose + frequency + route + action), allergies, mental-health flags, deferred diagnoses, and fabrication traps (phrases that must NOT appear because the patient explicitly denied them). - Paste scribe output. Whatever your AI scribe produced — SOAP note, summary, structured JSON.
- Grade. A rule-based grader extracts drug mentions, dose, and frequency from the scribe output; normalises every drug string against RxNorm
approximateTerm+properties; cross-checks dosage forms against openFDA NDC; matches against ground truth; scores three categories.
Scoring (category weights are per-script and live in each YAML):
- Drug Accuracy — fraction of expected meds correctly named, dosed, and frequenced (partial credit for partial matches).
- Mental-Health Capture — fraction of mental-health flags whose keywords appear.
- Fabrication Score —
1 - (triggered traps / total traps). Phantom drugs (a drug in the scribe output that has no good RxNorm match and isn't in ground truth) each subtract 0.15.
A vendor earns a hotlinkable Trial Pass SVG badge when their average across ≥3 submissions is ≥80%.
Stack
- Node.js ≥22
- Express, helmet, compression
- better-sqlite3 (WAL)
- js-yaml, nanoid
- Vanilla JS frontend, dark theme, hash router
Endpoints
All mounted under /scribe-trials. No auth.
| Method | Path | Description |
|---|---|---|
| GET | /scribe-trials/health | {ok:true} |
| GET | /scribe-trials/api/scripts | List of trials (id, title, difficulty, summary, version_hash). Ground truth not exposed. |
| GET | /scribe-trials/api/scripts/:id | Encounter text + instructions. Ground truth still not exposed. |
| POST | /scribe-trials/api/submit | Body: {script_id, vendor, scribe_output}. Grades synchronously, persists, returns full result. |
| GET | /scribe-trials/api/submissions/:id | Full graded result (ground truth revealed post-grading for transparency). |
| GET | /scribe-trials/api/leaderboard | Vendors with ≥3 runs, sorted by avg score. |
| GET | /scribe-trials/api/badge/:vendor.svg | shields.io-style SVG. 200 if vendor avg ≥80%; 404 SVG otherwise. |
| GET | /scribe-trials/ | SPA shell. |
Data sources
| Source | URL | Refresh | Cache |
|---|---|---|---|
| RxNorm approximate term match | https://rxnav.nlm.nih.gov/REST/approximateTerm.json | per drug mention at grading | 7 days in SQLite |
| RxNorm rxcui properties | https://rxnav.nlm.nih.gov/REST/rxcui/{rxcui}/properties.json | per matched RxCUI | 7 days |
| openFDA NDC Directory | https://api.fda.gov/drug/ndc.json | per matched drug | 7 days |
| Curated test corpus | corpus/scripts/*.yaml (in-repo) | loaded at server start | n/a |
When RxNorm or openFDA is unreachable, grading does not crash — drugs are still scored against ground truth via fuzzy string match, and the response includes a reason field on each validation explaining the fallback.
Run locally
npm install
PORT=4779 node server.js
Then visit http://localhost:4779/scribe-trials/.
Smoke test
# Health
curl -s http://localhost:4779/scribe-trials/health
# -> {"ok":true,...}
# List trials
curl -s http://localhost:4779/scribe-trials/api/scripts | jq 'length'
# -> 8
# Grade a deliberately broken submission
curl -s -X POST http://localhost:4779/scribe-trials/api/submit \
-H 'content-type: application/json' \
-d '{"script_id":"03-depression-suicidal-ideation","vendor":"test","scribe_output":"Patient on sertaline 50mg daily. Reports insomnia. No SI."}' \
| jq '{total:.score_total, drug:.score_drug, mental:.score_mental, fab:.score_fabrication}'
# Confirm RxNorm was actually hit
sqlite3 data.db "SELECT source, count(*) FROM drug_cache GROUP BY source;"
```
Out of scope
- Authentication, accounts, sessions
- Audio transcription / EHR / FHIR integration
- HIPAA-compliant storage
- LLM-based grading (deterministic rule-based only)
- Mobile-first UI (desktop-first; doesn't break on mobile)
- Payments / billing
Adding your own trials
Drop a new YAML file in corpus/scripts/ matching the schema in any existing file. Restart the server. The corpus is hashed at load time; the hash appears on every trial card and submission so vendors can prove which version they scored against.
Schema (per script): id, title, difficulty (easy|medium|hard), summary, encounter_text, instructions, ground_truth { meds[], allergies[], mental_flags[], deferred_dx[], fabrication_traps[] }, weights { drug, mental, fabrication } (must sum to 1.0).