context-tax

Token-cost leaderboard for AI coding agents. Live, public dashboard that measures the real tokenizer-counted context tax popular open-source repos cost a Claude / GPT / Cursor / Cline session — using only real GitHub fetches and a real on-server tokenizer at runtime. No mocks. No seed data. No fake numbers.

Live: https://holyai.me/context-tax/
Stack: Node.js · Express · better-sqlite3 (WAL) · node-cron · helmet · compression · gpt-tokenizer (o200k_base)
Port: 4778 · BASE_PATH: /context-tax · Auth: none — every endpoint is public.

---

Why

Mid-2026 telemetry from Vantage, Stanford DEL and Augment Code's "70% Token Reduction" playbook all point to the same fact: roughly 70% of the tokens an AI coding agent burns are waste — re-reading the same source files, ingesting committed lock files, parsing generated bundles, and re-sending the full conversation history on every turn. Every API call ships the entire active context as input, so at scale the shape of the repo dominates the bill.

A question that nobody is answering publicly:

If two OSS libraries solve the same problem, which one is cheaper for my AI agent to understand?

context-tax is that answer. For every tracked repo it computes:

a real o200k_base token count over a 20-file source sample (gpt-tokenizer npm package, runs in-process — no external API),
a lock-file & generated-code bloat ratio over the full Git tree,
an agent-doc presence score (AGENTS.md, CLAUDE.md, .cursorrules, .windsurfrules, .clinerules, README),
a navigation difficulty score (avg + max path depth, top-level chaos),
a source-to-noise ratio and a doc density score,
a single composite Context Tax score 0–100 (lower = cheaper / more agent-friendly).

The product is the public leaderboard, the per-repo detail page, the share card SVG (/api/card/owner/name.svg), and the embeddable SVG badge (/api/badge/owner/name.svg) that maintainers can paste in their README.

---

Data sources (real, public, no fabrication)

| Source | URL | Auth | Frequency | Used for |
|---|---|---|---|---|
| GitHub repo metadata | https://api.github.com/repos/{owner}/{name} | optional GITHUB_TOKEN | every 12h, waved | stars, forks, language, license, default branch, pushed_at |
| GitHub recursive tree | https://api.github.com/repos/{owner}/{name}/git/trees/{branch}?recursive=1 | optional | every 12h | full file list w/ blob sizes; identifies agent-doc files |
| GitHub raw content | https://raw.githubusercontent.com/{owner}/{name}/{branch}/{path} | none | every 12h, capped 20 files/repo | sampled source files for tokenization |
| GitHub repository search | https://api.github.com/search/repositories | optional | daily 03:13 UTC | discover new high-star repos across 10 languages |
| GitHub releases | https://api.github.com/repos/{owner}/{name}/releases | optional | daily, optional | latest release timestamp (display only) |

GITHUB_TOKEN is optional. Without it the service runs against the unauthenticated GitHub API (60 req/hr per IP) and clamps the discovery pass to ~5 new repos per cycle. With a token (5000 req/hr) it comfortably refreshes 100+ repos every 12 h. The token is injected by the RNDLAB deploy vault as a __INJECT_FROM_VAULT__ placeholder in .env.example — never hardcode keys.

There are no seed numbers in this codebase. data/seed-repos.json contains a list of repo identifiers (owner/name) used to seed the discovery loop — these are not data, they are GitHub addresses we ask the cron to consider. If a fetch fails, the failure is recorded in fetch_runs and the previous snapshot remains visible. We never substitute synthetic values, never call Math.random, never invent token counts.

---

API

All under BASE_PATH (default /context-tax). All public, all read-only except the rate-limited write endpoints below. CORS is open so badges can embed anywhere.

| Method | Path | Returns |
|---|---|---|
| GET | /health | {ok:true, repo_count, snapshot_count, last_snapshot_at, uptime_sec} |
| GET | /api/leaderboard?sort=&order=&lang=&min_stars=&has_agent_docs=&q=&page=&limit= | paginated list of latest snapshots |
| GET | /api/repo/:owner/:name | full detail: repo meta, latest snapshot, axis breakdown, history, sampled files |
| GET | /api/repo/:owner/:name/files | sampled-file list with per-file token counts |
| GET | /api/repo/:owner/:name/history | last 30 snapshots (sparkline source) |
| GET | /api/stats | aggregate stats incl. language distribution + axis averages |
| GET | /api/axes | machine-readable axis definitions + weights + tokenizer label |
| GET | /api/sources | data-source transparency: per-source success rate, avg latency, recent fetches |
| GET | /api/badge/:owner/:name.svg | 200×40 embeddable badge |
| GET | /api/card/:owner/:name.svg | 450×260 share-card |
| POST | /api/refresh/:owner/:name | manual refresh, 1 req / minute / IP |
| POST | /api/submit | body {owner, repo}, requires ≥ 50 stars, 5 req / hour / IP |
| POST | /api/prune-rate | janitor, removes rate-limit rows older than 24 h |

---

Scoring

Each axis is 0–100 (higher = more agent-friendly). The composite tax score is 100 − weighted_average(axes), so lower tax = cheaper.

| Axis | Weight | What it measures |
|---|---:|---|
| Size sanity | 25% | total sampled tokens; tiered (≤30k=100, …, >1.5M=5) |
| Lock & generated bloat | 20% | share of tree bytes from lock files / dist / build / .min / vendor / committed node_modules |
| Agent-doc presence | 20% | +25 each for AGENTS.md, CLAUDE.md, .cursorrules, .windsurfrules, .clinerules. Missing README = -10. |
| Navigation friendliness | 15% | avg + max path depth + top-level file count |
| Source-to-noise ratio | 10% | source-code tokens / total sampled tokens |
| Doc density | 10% | docs tokens / total sampled tokens. Sweet spot 5–15% |

Weights and thresholds are surfaced via GET /api/axes for full transparency.

---

Embed the badge

![context-tax](https://holyai.me/context-tax/api/badge/facebook/react.svg)

Color grades green → red as the score rises. The share card is at …/api/card/owner/name.svg and is 450×260, dark-themed.

---

Run it locally

cp .env.example .env
# Optional: set GITHUB_TOKEN to a GitHub PAT to raise the rate limit
npm install
node server.js
# → http://localhost:4778/context-tax/

On first boot, the cron kicker fires an initial refresh of 3 repos so the leaderboard is non-empty within ~60 seconds. The 7 /6 cron then refreshes 25 stalest repos every 6 hours, and a daily 03:13 UTC discovery pass adds new high-star repos.

---

Architecture

server.js              # Express boot, BASE_PATH mount, cron init
db.js                  # better-sqlite3 (WAL) init + prepared statements
score.js               # pure axis + composite scoring
refresh.js             # orchestrator: meta → tree → sample → tokenize → score → snapshot
fetchers/github.js     # thin REST + raw client, logs every call to fetch_runs
fetchers/tokenize.js   # gpt-tokenizer + path classification
fetchers/sampler.js    # which files to fetch (largest source, cap 20)
routes/api.js          # JSON endpoints + IP rate limiting
routes/badges.js       # SVG badge + share card
data/seed-repos.json   # ~90 owner/name strings used for discovery seeding
public/index.html      # vanilla SPA shell
public/app.js          # tab switching, fetch wrappers, modal detail view
public/style.css       # dark theme

No build step, no transpiler, no framework. Total LOC > 1,500.

---

License

MIT.