grep-tax
Score any public GitHub repo on how many tokens an AI coding agent burns to
navigate it. Letter grade + shareable README badge.
What it does
Paste a public GitHub URL → grep-tax shallow-clones the repo (git clone --depth 1,
90-second timeout), enumerates source files with an extension whitelist, then runs
a fixed five-question navigation benchmark using two strategies:
- Naive (what Claude Code / Cursor do): pick a high-IDF keyword, grep across
- every file, sort matches by hit count, read each full file until 95% of the
- ground-truth chunks are covered.
- Smart (what a Semble-style retriever does): chunk every file into 50-line
- windows (5-line overlap), BM25-rank chunks against the question, read chunks in
- rank order until 95% coverage.
Token counts come from gpt-tokenizer (cl100k_base). The ratio of the two numbers
is the "grep tax" — the cost an AI agent pays when it doesn't have semantic search.
Ground truth is regex-based and deterministic. Every scorecard lists the exact
patterns used, so the grade is reproducible.
Endpoints
All routes mount under /grep-tax. Health endpoint has no auth.
| Method | Path | Purpose |
|--------|------|---------|
| GET | /grep-tax/health | {ok:true} 200 |
| GET | /grep-tax/ | SPA: submit form, leaderboards |
| GET | /grep-tax/r/:owner/:name | Scorecard HTML with OG meta |
| POST | /grep-tax/api/scan | {url} → {id,status,cached} |
| GET | /grep-tax/api/scan/:id | Scan status + partial results |
| GET | /grep-tax/api/repo/:owner/:name | Latest complete scan for a repo (404 if never scanned) |
| GET | /grep-tax/api/leaderboard?sort=grade\|ratio&order=asc\|desc&limit=20 | Top repos |
| GET | /grep-tax/api/recent?limit=10 | Last N completed scans |
| GET | /grep-tax/badge/:owner/:name.svg | Shields-style SVG, 24h Cache-Control |
Data sources
Every fact in this product is fetched from a real public endpoint at scan time.
No seeded leaderboards. No mock benchmarks.
| Source | URL | When |
|--------|-----|------|
| GitHub REST API — repo metadata (stars, primary language, size, default branch) | https://api.github.com/repos/{owner}/{repo} | Once per scan submit. 60 req/hr unauthenticated; set GITHUB_TOKEN for 5000 req/hr. |
| GitHub Git protocol — shallow clone | https://github.com/{owner}/{repo}.git via git clone --depth 1 --filter=blob:limit=256k | Once per scan (skipped on 24 h cache hit). 90 s timeout. Working tree deleted after scan. |
There is no embedding API call, no LLM judge, no third-party scoring service.
BM25 and tokenization run locally.
Refresh / cache
- Per-repo cache: the latest completed scan is served for 24 h. To force a
- fresh scan, POST with
{"url":"…","force":true}or use the rescan button on - any scorecard.
- Badge cache: the SVG endpoint sets
Cache-Control: public, max-age=86400 - so README badges don't hammer the server when a popular repo's page loads.
Running locally
npm install
PORT=4806 npm start
# → http://127.0.0.1:4806/grep-tax/
GITHUB_TOKEN in .env (or in your shell) is optional — it only raises the
metadata rate limit from 60/hr to 5000/hr. Cloning works unauthenticated.
File layout
grep-tax/
├── server.js Express boot, routes mount, queue worker tick
├── db.js better-sqlite3 init (WAL) + prepared statements
├── routes/
│ ├── scan.js POST /api/scan, GET /api/scan/:id
│ ├── repo.js GET /api/repo/:owner/:name
│ ├── leaderboard.js GET /api/leaderboard, GET /api/recent
│ ├── badge.js GET /badge/:owner/:name.svg
│ └── report.js GET /r/:owner/:name (HTML shell + OG meta)
├── lib/
│ ├── github.js URL parsing, REST metadata fetch, size guards
│ ├── clone.js shallow clone, file enumeration, extension filter
│ ├── benchmark.js the 5 fixed queries + ground-truth regex
│ ├── grep.js naive grep+read simulator
│ ├── retriever.js chunker (50-line / 5-overlap) + BM25 ranker
│ ├── tokens.js gpt-tokenizer wrapper + per-scan cache
│ ├── grade.js avg tokens → letter grade + color
│ ├── queue.js single-worker scan pipeline
│ └── badge-svg.js shields-style SVG generator
└── public/
├── index.html home SPA (form + leaderboards)
├── app.js submit + live-poll + leaderboard render
├── report.js scorecard hydration
└── style.css dark theme
What's out of scope
- Auth, login, API keys (every endpoint is public).
- Private repos.
- Real embedding models — BM25 over chunks is the smart strategy for the MVP.
- Custom queries — the five questions are fixed.
- Payments, tiers, accounts, webhooks.
- Tree-sitter / LSP / AST parsing — regex on file contents only.
Stack
Node ≥ 22, Express 5, better-sqlite3 (WAL), helmet, compression, nanoid,
gpt-tokenizer. SQLite is the only persistence. No worker queue infra.