grep-tax

Score any public GitHub repo on how many tokens an AI coding agent burns to
navigate it. Letter grade + shareable README badge.

What it does

Paste a public GitHub URL → grep-tax shallow-clones the repo (git clone --depth 1,
90-second timeout), enumerates source files with an extension whitelist, then runs
a fixed five-question navigation benchmark using two strategies:

Naive (what Claude Code / Cursor do): pick a high-IDF keyword, grep across
every file, sort matches by hit count, read each full file until 95% of the
ground-truth chunks are covered.
Smart (what a Semble-style retriever does): chunk every file into 50-line
windows (5-line overlap), BM25-rank chunks against the question, read chunks in
rank order until 95% coverage.

Token counts come from gpt-tokenizer (cl100k_base). The ratio of the two numbers
is the "grep tax" — the cost an AI agent pays when it doesn't have semantic search.

Ground truth is regex-based and deterministic. Every scorecard lists the exact
patterns used, so the grade is reproducible.

Endpoints

All routes mount under /grep-tax. Health endpoint has no auth.

| Method | Path | Purpose |
|--------|------|---------|
| GET | /grep-tax/health | {ok:true} 200 |
| GET | /grep-tax/ | SPA: submit form, leaderboards |
| GET | /grep-tax/r/:owner/:name | Scorecard HTML with OG meta |
| POST | /grep-tax/api/scan | {url} → {id,status,cached} |
| GET | /grep-tax/api/scan/:id | Scan status + partial results |
| GET | /grep-tax/api/repo/:owner/:name | Latest complete scan for a repo (404 if never scanned) |
| GET | /grep-tax/api/leaderboard?sort=grade\|ratio&order=asc\|desc&limit=20 | Top repos |
| GET | /grep-tax/api/recent?limit=10 | Last N completed scans |
| GET | /grep-tax/badge/:owner/:name.svg | Shields-style SVG, 24h Cache-Control |

Data sources

Every fact in this product is fetched from a real public endpoint at scan time.
No seeded leaderboards. No mock benchmarks.

| Source | URL | When |
|--------|-----|------|
| GitHub REST API — repo metadata (stars, primary language, size, default branch) | https://api.github.com/repos/{owner}/{repo} | Once per scan submit. 60 req/hr unauthenticated; set GITHUB_TOKEN for 5000 req/hr. |
| GitHub Git protocol — shallow clone | https://github.com/{owner}/{repo}.git via git clone --depth 1 --filter=blob:limit=256k | Once per scan (skipped on 24 h cache hit). 90 s timeout. Working tree deleted after scan. |

There is no embedding API call, no LLM judge, no third-party scoring service.
BM25 and tokenization run locally.

Refresh / cache

Per-repo cache: the latest completed scan is served for 24 h. To force a
fresh scan, POST with {"url":"…","force":true} or use the rescan button on
any scorecard.
Badge cache: the SVG endpoint sets Cache-Control: public, max-age=86400
so README badges don't hammer the server when a popular repo's page loads.

Running locally

npm install
PORT=4806 npm start
# → http://127.0.0.1:4806/grep-tax/

GITHUB_TOKEN in .env (or in your shell) is optional — it only raises the
metadata rate limit from 60/hr to 5000/hr. Cloning works unauthenticated.

File layout

grep-tax/
├── server.js                 Express boot, routes mount, queue worker tick
├── db.js                     better-sqlite3 init (WAL) + prepared statements
├── routes/
│   ├── scan.js               POST /api/scan, GET /api/scan/:id
│   ├── repo.js               GET /api/repo/:owner/:name
│   ├── leaderboard.js        GET /api/leaderboard, GET /api/recent
│   ├── badge.js              GET /badge/:owner/:name.svg
│   └── report.js             GET /r/:owner/:name (HTML shell + OG meta)
├── lib/
│   ├── github.js             URL parsing, REST metadata fetch, size guards
│   ├── clone.js              shallow clone, file enumeration, extension filter
│   ├── benchmark.js          the 5 fixed queries + ground-truth regex
│   ├── grep.js               naive grep+read simulator
│   ├── retriever.js          chunker (50-line / 5-overlap) + BM25 ranker
│   ├── tokens.js             gpt-tokenizer wrapper + per-scan cache
│   ├── grade.js              avg tokens → letter grade + color
│   ├── queue.js              single-worker scan pipeline
│   └── badge-svg.js          shields-style SVG generator
└── public/
    ├── index.html            home SPA (form + leaderboards)
    ├── app.js                submit + live-poll + leaderboard render
    ├── report.js             scorecard hydration
    └── style.css             dark theme

What's out of scope

Auth, login, API keys (every endpoint is public).
Private repos.
Real embedding models — BM25 over chunks is the smart strategy for the MVP.
Custom queries — the five questions are fixed.
Payments, tiers, accounts, webhooks.
Tree-sitter / LSP / AST parsing — regex on file contents only.

Stack

Node ≥ 22, Express 5, better-sqlite3 (WAL), helmet, compression, nanoid,
gpt-tokenizer. SQLite is the only persistence. No worker queue infra.