← back to gallery

llms.txt Coach

Audit any site's llms.txt for spec compliance and generate a drop-in replacement

dev-toolsllms.txtagent-readableauditlighthousedevrelseoai-discovery
Open product ↗

llmstxt-coach

Audits any URL's llms.txt against the llmstxt.org spec, returns an A–F report card with specific fixes, and generates a drop-in replacement built from the site's real sitemap. Lighthouse / securityheaders.com pattern, applied to the new agent-readable file format.

For marketers, DevRel, and SEO operators who just read the Anna's Archive HN post and asked: "do we have an llms.txt, and is it any good?"

What it does

  1. Audit — hit /llms.txt, /llms-full.txt, /robots.txt, and the homepage. Parse the file against the llmstxt.org spec. Probe each link with HEAD/GET (sampled up to 25 in parallel). Return a deterministic 9-point checklist + score 0–100 + grade A–F.
  2. Generate — fetch the homepage and /sitemap.xml (following sitemap-index up to one level, max 500 URLs). Bucket URLs by first path segment using a built-in mapping (/docs/ → "Documentation", /blog/ → "Blog", /api/ → "API Reference", etc.). Fetch the top 5 URLs per bucket for <title> + <meta description>. Emit a spec-valid markdown file. If OPENROUTER_API_KEY is set, Claude Haiku rewrites the descriptions only* (URLs and titles are never invented).
  3. Leaderboard — every Monday at 04:00 UTC, a cron job pulls the top 1000 domains from the Tranco public list and audits each one with 2-second spacing. On first boot the first 50 are audited immediately so the page has data within ~2 minutes.
  4. Share card — each audit produces a permalink /llmstxt-coach/audit/:slug with server-rendered Open Graph meta tags pointing at a 1200×630 SVG that unfurls on Twitter/LinkedIn/Slack as "stripe.com · GRADE F".

Grader rubric

| # | Check | Weight | Pass condition |
|---|---|---|---|
| 1 | llms.txt returns 200, non-empty | 30 | HTTP 200 + body (auto-F if fails) |
| 2 | Has H1 site name | 10 | First non-blank line starts with # |
| 3 | Has > blockquote summary after H1 | 10 | Next non-blank block is a blockquote |
| 4 | ≥ 2 H2 sections | 10 | Lines starting with ## ≥ 2 |
| 5 | Each section has ≥ 1 link item | 10 | - name: desc parses |
| 6 | Sampled links return 200 | 15 | Prorated (live / sampled) × 15, max 25 sampled |
| 7 | robots.txt does not Disallow: /llms.txt for * | 5 | No matching Disallow rule |
| 8 | llms-full.txt also exists (bonus) | 5 | HTTP 200 |
| 9 | Mean link description ≥ 20 chars | 5 | Content-quality signal |

Grades: A ≥ 90 · B ≥ 80 · C ≥ 70 · D ≥ 60 · F < 60. Auto-F if check #1 fails.

API endpoints (all under /llmstxt-coach)

| Method | Path | Purpose |
|---|---|---|
| GET | /health | {ok, version, audits_count, last_cron} |
| POST | /api/audit | Body {url}. Returns {slug, grade, score, checks, llms_txt_raw, share_url, og_url, …}. Idempotent: same URL within 60s returns cached row. |
| GET | /api/audit/:slug | Persisted audit JSON. |
| GET | /audit/:slug | SPA shell with server-rendered og:image, og:title, twitter:card meta for social unfurls. |
| POST | /api/generate | Body {url}. Returns {markdown, sources, used_llm, domain}. |
| GET | /api/leaderboard?sort=rank\|score\|grade&limit=&min_grade= | Array of {rank, domain, grade, score, audited_at, slug}. |
| GET | /api/leaderboard/refresh-status | {last:<cron_log row>, audited_count}. |
| GET | /leaderboard, /generate | SPA shell (hash-routed client-side). |
| GET | /og/:slug.svg | 1200×630 SVG OG card. Cache-Control: public, max-age=86400. |

Data sources

Every claim of "live data" maps to a real URL, fetched at runtime. No seeds, no fakes, no Math.random.

| Source | URL | Frequency | On failure |
|---|---|---|---|
| Target llms.txt | https://{target}/llms.txt | On demand per audit (5s timeout) | has_llms_txt=false, auto-F |
| Target llms-full.txt | https://{target}/llms-full.txt | On demand per audit | Bonus check fails, −5 pts |
| Target robots.txt | https://{target}/robots.txt | On demand per audit | Robots check skipped |
| Target homepage HTML | https://{target}/ | On demand per audit + generate | If 4xx/5xx: audit aborts; generate falls back to sitemap-only meta |
| Target sitemap | https://{target}/sitemap.xml and /sitemap_index.xml | On demand per generate (max 500 URLs, max 5 nested) | 422 "no sitemap found" |
| Sampled link probes | each <url> inside parsed llms.txt | On demand per audit (HEAD with GET fallback, concurrency 8, 4s timeout, 25 sampled) | Each marked dead, prorated points |
| Tranco top-1M | https://tranco-list.eu/top-1m-id then https://tranco-list.eu/download/{id}/1000 | Weekly cron 0 4 1 UTC; first boot also seeds first 50 | Logged to cron_log table; previous week's roster retained |
| OpenRouter (optional) | https://openrouter.ai/api/v1/chat/completions (anthropic/claude-haiku-4.5) | On demand per generate, only if OPENROUTER_API_KEY set | Falls back silently to deterministic template |

Cloudflare Radar was dropped because its top-domains feed requires a paid API token. Tranco is fully open and authoritative; one real source beats two flaky ones.

Run locally

npm install
PORT=4851 node server.js
# open http://localhost:4851/llmstxt-coach/

Smoke tests against the live server:

# Real audit of a site that ships llms.txt
curl -s -X POST http://localhost:4851/llmstxt-coach/api/audit \
  -H "Content-Type: application/json" \
  -d '{"url":"https://docs.anthropic.com"}' | jq '{grade, score, has_llms_txt, link_count}'

# Missing case
curl -s -X POST http://localhost:4851/llmstxt-coach/api/audit \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com"}' | jq '{grade, has_llms_txt}'

# Generate from real sitemap
curl -s -X POST http://localhost:4851/llmstxt-coach/api/generate \
-H "Content-Type: application/json" \
-d '{"url":"https://nextjs.org"}' | jq '{used_llm, sitemap_urls_count:.sources.sitemap_urls_count, sections_built:.sources.sections_built}'

# Leaderboard (populated by Tranco seed on first boot)
curl -s "http://localhost:4851/llmstxt-coach/api/leaderboard?limit=10" | jq '.[0]'
```

Tech

Node ≥ 22 · Express · better-sqlite3 (WAL) · node-cron · helmet · compression · xml2js. SQLite file at ./data.db (override with DB_PATH). Vanilla-JS SPA frontend (no build step). Dark theme.

Out of scope

No auth, no accounts, no scheduled re-audits per user, no hosting the generated file, no PNG OG cards (SVG only — Twitter/LinkedIn render SVG og:image fine), no multi-language generation, no deep-crawl beyond homepage + sitemap, no payments, no admin pages.

Environment

PORT=4851
NODE_ENV=production
OPENROUTER_API_KEY=__INJECT_FROM_VAULT__   # optional — only used to polish generate output
BRAVE_API_KEY=__INJECT_FROM_VAULT__        # unused today; reserved for future search-enriched generate

OpenRouter is optional and the product works fully without it; the LLM only rewrites scraped descriptions and never invents URLs or section names.