A 2024 audit by the Ontario Office of the Auditor General reviewed 20 AI medical scribe systems and found that 9/20 fabricated treatment plans and 12/20 inserted wrong drug information. No public benchmark exists for buyers to verify vendor claims. This is that benchmark.
Ontario Auditor General, "Audit of Artificial Intelligence Pilots in Health Care" (2024). Statistics quoted publicly.
Eight hand-authored synthetic patient encounters with structured ground truth: meds with dose/frequency, allergies, mental-health flags, deferred diagnoses, fabrication traps.
Whatever your AI scribe produces from the encounter — SOAP note, structured JSON, free-form summary. We grade what you give us.
Drug names normalised against RxNorm. Dosage forms checked against openFDA NDC. Category-weighted score: Drug Accuracy, Mental-Health Capture, Fabrication Rate.
Pick a synthetic encounter. The full ground truth is hidden until after grading.
| Found | Dose | Frequency | RxNorm | openFDA forms | Matched expected | Credit |
|---|
Vendors with ≥3 submissions. Average across all trials. Trial Pass badge eligible at ≥80%.
| # | Vendor | Avg score | Runs | Last run | Badge |
|---|
A public, deterministic, rule-based benchmark for AI medical scribes. Hand-authored synthetic encounters → you submit scribe output → graded against structured ground truth with live RxNorm and openFDA lookups.
Vendors with average ≥80% across at least 3 submissions earn a hotlinkable SVG badge: /scribe-trials/api/badge/<your-handle>.svg. Clinics can require it in RFPs.
corpus/scripts/*.yaml. Reloaded at server start.