dataset-drop
Every meaningful new training & evaluation dataset on the HuggingFace Hub — without scrolling through 5,000 forks.
dataset-drop is a live, public dashboard for fresh open-source dataset releases on the HuggingFace Hub. It polls the public HF API every two hours, filters out the daily flood of toy uploads, mirrored gov-data scrapers, and personal scratch repos, and ranks the survivors by 7-day download velocity — the signal that separates a real release from a forgotten upload.
It is the dataset-side complement to weights-drop, which tracks new open-weight model releases.
Why this exists (May 2026 context)
HuggingFace publishes roughly 5,000 new dataset uploads per day. The vast majority are personal one-file dumps, mirrored open-data scrapers, audio re-uploads of pre-existing corpora, or "test123" scratch repos. The genuinely interesting fine-tune-grade and eval-grade datasets — Llama-Nemotron-Post-Training v2, OpenMath-Reasoning v3, Pi-coding-agent traces, A2A agent-trace dumps, multilingual eval extensions — number maybe 20–80 per week, but they're buried.
The HF UI sorts by trending or downloads, but trending-score is dominated by old hits and downloads-by-default surfaces decades-old anchors like wikitext. There is no public dashboard for the release event with a noise filter.
dataset-drop is that page.
Live
- Production: <https://holyai.me/dataset-drop/>
- Health check: <https://holyai.me/dataset-drop/health>
- JSON Feed: <https://holyai.me/dataset-drop/api/feed.json>
- Atom: <https://holyai.me/dataset-drop/api/feed.atom>
Real data sources
All data is fetched at runtime from public APIs. No hardcoded dataset lists, no Math.random() jitter, no "preset seed" fallback. If an API is unreachable, the rows simply stay stale and the refresh log records the error.
| Source | URL | Cron |
|---|---|---|
| HF datasets list (recent) | https://huggingface.co/api/datasets?sort=createdAt&direction=-1&limit=100&full=true × 30 pages | every 2 h |
| HF datasets list (trending) | https://huggingface.co/api/datasets?sort=trendingScore&direction=-1&limit=100&full=true × 5 pages | every 6 h |
| HF datasets list (top week) | https://huggingface.co/api/datasets?sort=downloads&direction=-1&limit=100&full=true × 5 pages | every 12 h |
| HF single dataset (drill-down) | https://huggingface.co/api/datasets/{repo_id} | on-demand + nightly velocity job |
| HF dataset README | https://huggingface.co/datasets/{repo_id}/raw/main/README.md | on-demand, cached 24 h |
| HF daily papers | https://huggingface.co/api/daily_papers?limit=50 | every 12 h |
The HF API requires no auth but accepts an optional bearer token to raise rate-limit headroom. The product runs fully anonymously by default.
Quality filter
Each ingested row gets a 0–100 quality_score. Default UI threshold is 30. Anything excluded (disabled, gated, private) is hidden everywhere.
Additive points:
- +15 — has a real
licensedeclared - +12 — has a
task_categoriestag - +10 — has a meaningful description (≥ 60 chars)
- +10 — author has ≥ 5 other datasets
- +15 — author is in
known_orgs(dynamically recomputed nightly — owners with ≥ 50 datasets and ≥ 10 k aggregate downloads) - +15 — got ≥ 10 downloads in first 24 h
- +20 — got ≥ 100 downloads in first 7 d
- +10 — linked to an arXiv paper
- +8 — has a
pretty_name - +6 — has a
size_categoriestag
Penalties:
- −20 — only
region:*tags, nothing else - −25 — repo name matches the noise regex (
test|tmp|asdf|untitled|copy|backup|aaaa|delete|dummy) - −10 — repo id shorter than 6 chars
- excluded — disabled OR gated OR private
The known_orgs list is never hard-coded; it is recomputed nightly from the local datasets table.
Stack
- Node.js ≥ 20, Express 4
- better-sqlite3 (WAL)
- node-cron
- helmet, compression
- Vanilla JS SPA (no framework), dark theme, English UI
URL surface
All routes mounted under BASE_PATH=/dataset-drop. No auth on any endpoint (Arda's pipeline rule — instant inspection).
Read endpoints (60 req/min/IP):
GET /dataset-drop/healthGET /dataset-drop/api/datasets?days=7&task=text-generation&license=apache-2.0&min_quality=40&q=&limit=50&offset=0&sort=created|downloads_24h|downloads_7d|velocity|likes|qualityGET /dataset-drop/api/datasets/:owner/:repoGET /dataset-drop/api/leaderboard/today|week|velocityGET /dataset-drop/api/authorsGET /dataset-drop/api/tasksGET /dataset-drop/api/licensesGET /dataset-drop/api/statsGET /dataset-drop/api/refresh-logGET /dataset-drop/api/known-orgsGET /dataset-drop/api/feed.jsonGET /dataset-drop/api/feed.atom
Write/trigger endpoints (6 req/min/IP):
POST /dataset-drop/api/refresh/recentPOST /dataset-drop/api/refresh/trendingPOST /dataset-drop/api/refresh/topweekPOST /dataset-drop/api/refresh/daily-papersPOST /dataset-drop/api/recompute-velocity
Running
npm install
node server.js
# → dataset-drop 1.0.0 listening on :4893/dataset-drop/ (no auth)
The server creates ./data/dataset-drop.db on first boot, fires a single-page recent fetch if the table is empty, and schedules all five cron jobs.
License
UNLICENSED (private, deployed under holyai.me).