license-pulse
Live dashboard tracking the license posture of the top open-weight AI models and datasets on Hugging Face. Detects permissive→restrictive shifts and "open-washing" the moment they happen.
Why it exists: Hugging Face has no native surface for "is the top of the open-weight catalog still permissive?" or "did this model just change its license?" — and license rug-pulls are accelerating in 2026 (MiniMax M2.7 flipping to non-commercial, Meta tightening Llama community terms, Mistral splitting into research-only variants). license-pulse closes that gap with a public, cron-snapshotted, append-only event log.
What it does
Every six hours the server fetches:
- The top 500 most-downloaded models on Hugging Face:
https://huggingface.co/api/models?sort=downloads&direction=-1&limit=500 - The top 200 most-downloaded datasets:
https://huggingface.co/api/datasets?sort=downloads&direction=-1&limit=200
Each entry's license:<id> tag is normalised into one of seven classes (permissive, weak_copyleft, strong_copyleft, responsible_ai, noncommercial, custom, unspecified) and snapshotted into SQLite. If a repo's license tag or class changed since the previous snapshot, an immutable row is appended to drift_events with the direction (tighter / looser / lateral / first_seen).
The SPA renders five tabs:
- Distribution — license-class share across the top 500 (or top 100, or models created in the last 30 days), with a stacked bar, legend, and per-class top examples.
- Drift Feed — append-only timeline of detected license changes. Filter by model/dataset/tighter-only.
- Orgs — leaderboard of Hugging Face organisations sorted by openness score (% of their public top-tier models under a fully permissive license). Click a row for the full model list and a 30-day openness sparkline.
- Datasets — same distribution view + drift feed, scoped to datasets.
- Lookup — paste any repo id, get its current classification, drift history (where we have one), and 5 permissive alternatives sharing the same
pipeline_tag.
Every endpoint is public and read-only. There is no auth — the underlying data is public, and viewing it shouldn't require credentials.
Stack
- Node.js 20+
- Express 4
- better-sqlite3 (WAL mode)
- node-cron
- helmet, compression
- Vanilla JS SPA, dark theme, English UI
No bundler. No frontend framework. No Tailwind. The SPA is three files in public/.
Endpoints
All routes are prefixed with BASE_PATH=/license-pulse.
| Method | Path | Description |
|--------|--------------------------------------------------|------------------------------------------------------------|
| GET | /license-pulse/ | SPA shell |
| GET | /license-pulse/health | {status:"ok",ts} — no auth, used by orchestrator |
| GET | /license-pulse/api/distribution?scope=top500 | License share across top 500 / top 100 / recent 30 |
| GET | /license-pulse/api/drift?limit=50&since=ISO | Drift event feed |
| GET | /license-pulse/api/orgs?limit=40&min_models=3 | Org openness leaderboard |
| GET | /license-pulse/api/orgs/:org | Org detail (model list + 30d sparkline) |
| GET | /license-pulse/api/datasets?scope=top200 | Dataset distribution + drift |
| GET | /license-pulse/api/lookup?id=<org/repo> | Live lookup with HF passthrough + alternatives |
| GET | /license-pulse/api/stats | Counts + last-run summary |
| POST | /license-pulse/api/refresh | Manual snapshot trigger (no auth) |
Data sources
Public, unauthenticated, no API key required:
https://huggingface.co/api/models(withsort,direction,limit,pipeline_tagfilters)https://huggingface.co/api/datasetshttps://huggingface.co/api/models/<repo_id>(single repo detail, used by/lookup)https://huggingface.co/api/datasets/<repo_id>
Cron cadence:
0 /6— full snapshot of top 500 models + top 200 datasets30 4 *— daily summary tick (re-snapshots; future hook for aggregate tables)- On boot: kick a fresh run if the last successful run was more than 12 hours ago.
Running locally
cp .env.example .env # default port 4739
npm install # better-sqlite3 builds against your local Node
npm start
# open http://localhost:4739/license-pulse/
The very first request after boot will show "no data yet — first cron run pending" until the boot-kick run finishes (~5–15 s depending on network).
Real-data discipline
- No mock data. No
Math.random()jitter. No hard-coded model lists. - Every chart, table and number originates from a
fetch()tohuggingface.co/api/...resolved at runtime. - If the HF API returns nothing, the UI says so honestly.
- DB starts empty. The first cron run populates it.
License classification taxonomy
Implemented in lib/classify.js. Buckets:
- permissive —
apache-2.0,mit,bsd-,cc0-1.0,cc-by-(no NC),unlicense,wtfpl,0bsd,isc,zlib,artistic-2.0, ODC/PDDL, EPL. - weak_copyleft —
lgpl-,mpl-. - strong_copyleft —
gpl-*,agpl-3.0,osl-3.0. - responsible_ai — anything containing
openrail,rail,bigscience-bloom-rail, etc. - noncommercial —
cc-by-nc-,cc-by-nc-sa-,mistral-research-license,mistral-ai-non-production-license,flux-1-dev,stable-cascade-nc-community. - custom —
other,llama2/3/4,gemma,qwen,tongyi-qianwen,deepseek,falcon-180b,cohere-openweights,apple-amlr,cdla-, etc. - unspecified — missing/null tag.
commercial_ok:
- permissive, weak_copyleft, strong_copyleft → 1
- noncommercial → 0
- everything else → -1 ("read the terms before shipping")
File layout
license-pulse/
├── server.js
├── db.js
├── package.json
├── .env.example
├── .gitignore
├── CLAUDE.md
├── README.md ← you are here
├── SPEC.md
├── lib/
│ ├── classify.js
│ ├── hf.js
│ ├── drift.js
│ └── aggregates.js
├── routes/
│ ├── index.js
│ ├── distribution.js
│ ├── drift.js
│ ├── orgs.js
│ ├── datasets.js
│ ├── lookup.js
│ └── stats.js
└── public/
├── index.html
├── app.js
└── style.css
Roadmap (post-MVP)
- v1.1: license text full-text search + filterable export (CSV/JSON).
- v1.2: webhook/RSS export of the drift feed.
- v1.3: dataset → benchmark contamination crosswalk.
- v2: account-gated alerts ("notify me when any model in this org becomes non-commercial").
Author: Cowork (Claude Opus 4.7) · 2026-05-08.