llm-name-blacklist
A public, searchable database of the fake personas that frontier LLMs converge on when asked to invent people — names like "Elias Thorne the lighthouse keeper" or "Dr. Sarah Chen the oncologist".
Built for trust & safety teams at marketplaces (Amazon, Etsy, Substack), book publishers screening submissions, journalists hunting AI ghosts, and anyone Googling a suspicious Amazon reviewer or self-published author. Paste a name → get back "this is what 8 of 10 LLMs answer when asked to invent a small-business owner".
How it works
- Sample. The seed job runs
20 prompt templates × 6 models × 10 samples= 1,200 generations at temperature 0.7 against four cheap OSS chat models via OpenRouter (Llama-3.1-8B, Qwen-2.5-7B, Mistral-7B, DeepSeek-Chat) plus Anthropic Claude Haiku 4.5 and OpenAI GPT-4o mini. Every raw completion is persisted. - Extract. A small regex-first extractor pulls candidate
[Capitalised Capitalised]pairs (with optionalDr./Captainprefixes) and filters out a hand-curated stoplist of ~150 false-positive Capitalised words ("Monday", "Marketplace", "Lighthouse"). No ML. - Score. For every distinct normalised name:
agreement_score = distinct_models · ln(1 + distinct_prompts) · ln(1 + total_mentions). The leaderboard sorts by this score. - Cross-reference. When you open a name's detail page, the server fetches Google Books
inauthor:"<name>"and Brave Search"<name>"to surface where the name shows up in real publications and on the web. Cached 24 hours per name + source. - Refresh. A node-cron job re-runs the full prompt grid every Sunday at 02:00 UTC.
first_seenis preserved per name;last_seenis updated.
API
All endpoints live under /llm-name-blacklist.
| Method | Path | What it returns |
|---|---|---|
| GET | /health | { ok, names, runs, last_seed, last_refresh } |
| GET | /api/names?limit=200&archetype=…&minModels=… | Top-N names sorted by agreement_score DESC. Optional archetype filter (profession, review_author, small_biz, character, expert_quote). Optional minModels to filter to high-convergence names. |
| GET | /api/names/:slug | Full record: name row + per-model breakdown + per-prompt breakdown + sample raw outputs + wild findings (lazy-fetches Google Books + Brave on first view, cached 24h). |
| GET | /api/search?q= | Prefix + substring match against name_norm. Returns up to 25 results. |
| GET | /api/stats | Global counters: { totals, last_seed, last_refresh, prompts, models }. |
| GET | /api/archetypes | Distinct archetypes with distinct_names + mentions per archetype. |
| POST | /admin/seed | Kicks the full seed (fire-and-forget). 202 on accept, 409 if a batch is already running. Optional ?samples=N. |
| POST | /admin/refresh | Same shape — runs a refresh pass (preserves first_seen). |
| GET | /admin/status | { running, kind, phase, completed, total, errors, started_at, last_error } — live progress for the seed/refresh UI. |
| GET | / and /name/:slug | Serves the SPA. |
No auth on any endpoint. The product is read-only public data; admin triggers are mutex-guarded so double-clicks are no-ops.
Data sources
Every "live" value in the UI traces to one of these real APIs.
| Source | URL | Refresh |
|---|---|---|
| OpenRouter Chat Completions | https://openrouter.ai/api/v1/chat/completions | Once at seed; weekly (Sun 02:00 UTC) thereafter |
| Anthropic Messages | https://api.anthropic.com/v1/messages (claude-haiku-4-5-20251001) | Once at seed; weekly |
| OpenAI Chat Completions | https://api.openai.com/v1/chat/completions (gpt-4o-mini) | Once at seed; weekly |
| Google Books Volumes | https://www.googleapis.com/books/v1/volumes?q=inauthor:"<name>"&maxResults=5 | Per name, on first detail-page view; cached 24h |
| Brave Search Web | https://api.search.brave.com/res/v1/web/search?q="<name>"&count=10 | Per name, on first detail-page view; cached 24h |
If any provider fails mid-batch, that single call is recorded in runs.error and the batch continues. The aggregator only consumes successful runs.
Run locally
npm install
cp .env.example .env
# Fill in OPENROUTER_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY, BRAVE_API_KEY.
# GOOGLE_BOOKS_API_KEY is optional — empty falls back to the unauthenticated tier.
PORT=4797 node server.js
# Server logs at http://localhost:4797/llm-name-blacklist/
```
Then trigger the seed:
curl -X POST http://localhost:4797/llm-name-blacklist/admin/seed
# Watch progress:
watch -n2 'curl -s http://localhost:4797/llm-name-blacklist/admin/status'
# After ~5 min on real APIs, /api/stats shows non-zero totals.
Stack
- Node 22+, ES modules
- Express for routing
- better-sqlite3 (WAL) for storage
- node-cron for the weekly refresh
- helmet + compression
- Vanilla JS SPA — no build step
File layout
server.js Express bootstrap, mounts router at /llm-name-blacklist
db.js better-sqlite3 + WAL + schema migration on boot
config/prompts.json 20 prompt templates by archetype
config/models.json 6 models across 3 providers
routes/api.js GET /api/names, /api/names/:slug, /api/search, /api/stats, /api/archetypes
routes/admin.js POST /admin/seed, /admin/refresh; GET /admin/status (mutex-guarded, no auth)
routes/pages.js SPA fallback for / and /name/:slug
lib/providers/ openrouter.js, anthropic.js, openai.js — real chat completions, retries, 429 backoff
lib/extractor.js Regex + honorific-aware name extractor
lib/stoplist.js ~150 capitalised words that are not names
lib/aggregator.js Recompute names.agreement_score from name_mentions
lib/sampler.js runBatch(): worker-pool orchestrator with live progress state
lib/books.js Google Books inauthor lookup
lib/brave.js Brave web search
lib/wild.js getOrFetchWildFindings(slug, display) — 24h caching
lib/configLoader.js Loads config/*.json into the prompts + models tables
jobs/seed.js Initial run + aggregate
jobs/refresh.js Weekly re-run; preserves first_seen, updates last_seen
jobs/cron.js Registers node-cron Sunday 02:00 UTC entry
public/index.html SPA shell
public/app.js Router + views (home, detail, search)
public/style.css Dark theme
Status
- Server boots, mounts under
/llm-name-blacklist, health endpoint returns{ok:true}. - 20 prompts × 6 models are loaded into SQLite on boot from
config/*.json. - Endpoint surface (
/api/names,/api/search,/api/stats,/api/archetypes,/api/names/:slug) all respond and serve real data once the seed has run. - The seed itself requires the four API keys present in env at runtime. Without them the seed job will record
runs.errorrows and produce zero mentions; nothing crashes, but you'll have no leaderboard until keys are injected. - Weekly node-cron entry is registered on boot for Sundays at 02:00 UTC.
Not done & not planned
- No auth, no login, no rate-limiting. This is a public read-only ledger.
- No takedown automation, no NER, no avatar synthesis.
- No Amazon scraping — Brave Search covers the "in the wild" axis without ToS hazards.
- Visitor-submitted prompts are out of scope (abuse / cost sink).
- The "$49/mo API for T&S teams" is a future plan, not in this build.