license-pulse

Live dashboard tracking the license posture of the top open-weight AI models and datasets on Hugging Face. Detects permissive→restrictive shifts and "open-washing" the moment they happen.

Why it exists: Hugging Face has no native surface for "is the top of the open-weight catalog still permissive?" or "did this model just change its license?" — and license rug-pulls are accelerating in 2026 (MiniMax M2.7 flipping to non-commercial, Meta tightening Llama community terms, Mistral splitting into research-only variants). license-pulse closes that gap with a public, cron-snapshotted, append-only event log.

What it does

Every six hours the server fetches:

The top 500 most-downloaded models on Hugging Face: https://huggingface.co/api/models?sort=downloads&direction=-1&limit=500
The top 200 most-downloaded datasets: https://huggingface.co/api/datasets?sort=downloads&direction=-1&limit=200

Each entry's license:<id> tag is normalised into one of seven classes (permissive, weak_copyleft, strong_copyleft, responsible_ai, noncommercial, custom, unspecified) and snapshotted into SQLite. If a repo's license tag or class changed since the previous snapshot, an immutable row is appended to drift_events with the direction (tighter / looser / lateral / first_seen).

The SPA renders five tabs:

Distribution — license-class share across the top 500 (or top 100, or models created in the last 30 days), with a stacked bar, legend, and per-class top examples.
Drift Feed — append-only timeline of detected license changes. Filter by model/dataset/tighter-only.
Orgs — leaderboard of Hugging Face organisations sorted by openness score (% of their public top-tier models under a fully permissive license). Click a row for the full model list and a 30-day openness sparkline.
Datasets — same distribution view + drift feed, scoped to datasets.
Lookup — paste any repo id, get its current classification, drift history (where we have one), and 5 permissive alternatives sharing the same pipeline_tag.

Every endpoint is public and read-only. There is no auth — the underlying data is public, and viewing it shouldn't require credentials.

Stack

Node.js 20+
Express 4
better-sqlite3 (WAL mode)
node-cron
helmet, compression
Vanilla JS SPA, dark theme, English UI

No bundler. No frontend framework. No Tailwind. The SPA is three files in public/.

Endpoints

All routes are prefixed with BASE_PATH=/license-pulse.

| Method | Path | Description |
|--------|--------------------------------------------------|------------------------------------------------------------|
| GET | /license-pulse/ | SPA shell |
| GET | /license-pulse/health | {status:"ok",ts} — no auth, used by orchestrator |
| GET | /license-pulse/api/distribution?scope=top500 | License share across top 500 / top 100 / recent 30 |
| GET | /license-pulse/api/drift?limit=50&since=ISO | Drift event feed |
| GET | /license-pulse/api/orgs?limit=40&min_models=3 | Org openness leaderboard |
| GET | /license-pulse/api/orgs/:org | Org detail (model list + 30d sparkline) |
| GET | /license-pulse/api/datasets?scope=top200 | Dataset distribution + drift |
| GET | /license-pulse/api/lookup?id=<org/repo> | Live lookup with HF passthrough + alternatives |
| GET | /license-pulse/api/stats | Counts + last-run summary |
| POST | /license-pulse/api/refresh | Manual snapshot trigger (no auth) |

Data sources

Public, unauthenticated, no API key required:

https://huggingface.co/api/models (with sort, direction, limit, pipeline_tag filters)
https://huggingface.co/api/datasets
https://huggingface.co/api/models/<repo_id> (single repo detail, used by /lookup)
https://huggingface.co/api/datasets/<repo_id>

Cron cadence:

0 /6 — full snapshot of top 500 models + top 200 datasets
30 4 * — daily summary tick (re-snapshots; future hook for aggregate tables)
On boot: kick a fresh run if the last successful run was more than 12 hours ago.

Running locally

cp .env.example .env       # default port 4739
npm install                # better-sqlite3 builds against your local Node
npm start
# open http://localhost:4739/license-pulse/

The very first request after boot will show "no data yet — first cron run pending" until the boot-kick run finishes (~5–15 s depending on network).

Real-data discipline

No mock data. No Math.random() jitter. No hard-coded model lists.
Every chart, table and number originates from a fetch() to huggingface.co/api/... resolved at runtime.
If the HF API returns nothing, the UI says so honestly.
DB starts empty. The first cron run populates it.

License classification taxonomy

Implemented in lib/classify.js. Buckets:

permissive — apache-2.0, mit, bsd-, cc0-1.0, cc-by- (no NC), unlicense, wtfpl, 0bsd, isc, zlib, artistic-2.0, ODC/PDDL, EPL.
weak_copyleft — lgpl-, mpl-.
strong_copyleft — gpl-*, agpl-3.0, osl-3.0.
responsible_ai — anything containing openrail, rail, bigscience-bloom-rail, etc.
noncommercial — cc-by-nc-, cc-by-nc-sa-, mistral-research-license, mistral-ai-non-production-license, flux-1-dev, stable-cascade-nc-community.
custom — other, llama2/3/4, gemma, qwen, tongyi-qianwen, deepseek, falcon-180b, cohere-openweights, apple-amlr, cdla-, etc.
unspecified — missing/null tag.

commercial_ok:
- permissive, weak_copyleft, strong_copyleft → 1
- noncommercial → 0
- everything else → -1 ("read the terms before shipping")

File layout

license-pulse/
├── server.js
├── db.js
├── package.json
├── .env.example
├── .gitignore
├── CLAUDE.md
├── README.md            ← you are here
├── SPEC.md
├── lib/
│   ├── classify.js
│   ├── hf.js
│   ├── drift.js
│   └── aggregates.js
├── routes/
│   ├── index.js
│   ├── distribution.js
│   ├── drift.js
│   ├── orgs.js
│   ├── datasets.js
│   ├── lookup.js
│   └── stats.js
└── public/
    ├── index.html
    ├── app.js
    └── style.css

Roadmap (post-MVP)

v1.1: license text full-text search + filterable export (CSV/JSON).
v1.2: webhook/RSS export of the drift feed.
v1.3: dataset → benchmark contamination crosswalk.
v2: account-gated alerts ("notify me when any model in this org becomes non-commercial").

Author: Cowork (Claude Opus 4.7) · 2026-05-08.