dataset-drop

Every meaningful new training & evaluation dataset on the HuggingFace Hub — without scrolling through 5,000 forks.

dataset-drop is a live, public dashboard for fresh open-source dataset releases on the HuggingFace Hub. It polls the public HF API every two hours, filters out the daily flood of toy uploads, mirrored gov-data scrapers, and personal scratch repos, and ranks the survivors by 7-day download velocity — the signal that separates a real release from a forgotten upload.

It is the dataset-side complement to weights-drop, which tracks new open-weight model releases.

Why this exists (May 2026 context)

HuggingFace publishes roughly 5,000 new dataset uploads per day. The vast majority are personal one-file dumps, mirrored open-data scrapers, audio re-uploads of pre-existing corpora, or "test123" scratch repos. The genuinely interesting fine-tune-grade and eval-grade datasets — Llama-Nemotron-Post-Training v2, OpenMath-Reasoning v3, Pi-coding-agent traces, A2A agent-trace dumps, multilingual eval extensions — number maybe 20–80 per week, but they're buried.

The HF UI sorts by trending or downloads, but trending-score is dominated by old hits and downloads-by-default surfaces decades-old anchors like wikitext. There is no public dashboard for the release event with a noise filter.

dataset-drop is that page.

Live

Production: <https://holyai.me/dataset-drop/>
Health check: <https://holyai.me/dataset-drop/health>
JSON Feed: <https://holyai.me/dataset-drop/api/feed.json>
Atom: <https://holyai.me/dataset-drop/api/feed.atom>

Real data sources

All data is fetched at runtime from public APIs. No hardcoded dataset lists, no Math.random() jitter, no "preset seed" fallback. If an API is unreachable, the rows simply stay stale and the refresh log records the error.

| Source | URL | Cron |
|---|---|---|
| HF datasets list (recent) | https://huggingface.co/api/datasets?sort=createdAt&direction=-1&limit=100&full=true × 30 pages | every 2 h |
| HF datasets list (trending) | https://huggingface.co/api/datasets?sort=trendingScore&direction=-1&limit=100&full=true × 5 pages | every 6 h |
| HF datasets list (top week) | https://huggingface.co/api/datasets?sort=downloads&direction=-1&limit=100&full=true × 5 pages | every 12 h |
| HF single dataset (drill-down) | https://huggingface.co/api/datasets/{repo_id} | on-demand + nightly velocity job |
| HF dataset README | https://huggingface.co/datasets/{repo_id}/raw/main/README.md | on-demand, cached 24 h |
| HF daily papers | https://huggingface.co/api/daily_papers?limit=50 | every 12 h |

The HF API requires no auth but accepts an optional bearer token to raise rate-limit headroom. The product runs fully anonymously by default.

Quality filter

Each ingested row gets a 0–100 quality_score. Default UI threshold is 30. Anything excluded (disabled, gated, private) is hidden everywhere.

Additive points:

+15 — has a real license declared
+12 — has a task_categories tag
+10 — has a meaningful description (≥ 60 chars)
+10 — author has ≥ 5 other datasets
+15 — author is in known_orgs (dynamically recomputed nightly — owners with ≥ 50 datasets and ≥ 10 k aggregate downloads)
+15 — got ≥ 10 downloads in first 24 h
+20 — got ≥ 100 downloads in first 7 d
+10 — linked to an arXiv paper
+8 — has a pretty_name
+6 — has a size_categories tag

Penalties:

−20 — only region:* tags, nothing else
−25 — repo name matches the noise regex (test|tmp|asdf|untitled|copy|backup|aaaa|delete|dummy)
−10 — repo id shorter than 6 chars
excluded — disabled OR gated OR private

The known_orgs list is never hard-coded; it is recomputed nightly from the local datasets table.

Stack

Node.js ≥ 20, Express 4
better-sqlite3 (WAL)
node-cron
helmet, compression
Vanilla JS SPA (no framework), dark theme, English UI

URL surface

All routes mounted under BASE_PATH=/dataset-drop. No auth on any endpoint (Arda's pipeline rule — instant inspection).

Read endpoints (60 req/min/IP):

GET /dataset-drop/health
GET /dataset-drop/api/datasets?days=7&task=text-generation&license=apache-2.0&min_quality=40&q=&limit=50&offset=0&sort=created|downloads_24h|downloads_7d|velocity|likes|quality
GET /dataset-drop/api/datasets/:owner/:repo
GET /dataset-drop/api/leaderboard/today|week|velocity
GET /dataset-drop/api/authors
GET /dataset-drop/api/tasks
GET /dataset-drop/api/licenses
GET /dataset-drop/api/stats
GET /dataset-drop/api/refresh-log
GET /dataset-drop/api/known-orgs
GET /dataset-drop/api/feed.json
GET /dataset-drop/api/feed.atom

Write/trigger endpoints (6 req/min/IP):

POST /dataset-drop/api/refresh/recent
POST /dataset-drop/api/refresh/trending
POST /dataset-drop/api/refresh/topweek
POST /dataset-drop/api/refresh/daily-papers
POST /dataset-drop/api/recompute-velocity

Running

npm install
node server.js
# → dataset-drop 1.0.0 listening on :4893/dataset-drop/  (no auth)

The server creates ./data/dataset-drop.db on first boot, fires a single-page recent fetch if the table is empty, and schedules all five cron jobs.

License

UNLICENSED (private, deployed under holyai.me).