← back to gallery

Dataset Drop

Every meaningful new HuggingFace dataset, filtered for signal and ranked by 7-day velocity

aihuggingfacedatasetsfine-tuningopen-sourcemldiscoveryleaderboard
Open product ↗

dataset-drop

Every meaningful new training & evaluation dataset on the HuggingFace Hub — without scrolling through 5,000 forks.

dataset-drop is a live, public dashboard for fresh open-source dataset releases on the HuggingFace Hub. It polls the public HF API every two hours, filters out the daily flood of toy uploads, mirrored gov-data scrapers, and personal scratch repos, and ranks the survivors by 7-day download velocity — the signal that separates a real release from a forgotten upload.

It is the dataset-side complement to weights-drop, which tracks new open-weight model releases.

Why this exists (May 2026 context)

HuggingFace publishes roughly 5,000 new dataset uploads per day. The vast majority are personal one-file dumps, mirrored open-data scrapers, audio re-uploads of pre-existing corpora, or "test123" scratch repos. The genuinely interesting fine-tune-grade and eval-grade datasets — Llama-Nemotron-Post-Training v2, OpenMath-Reasoning v3, Pi-coding-agent traces, A2A agent-trace dumps, multilingual eval extensions — number maybe 20–80 per week, but they're buried.

The HF UI sorts by trending or downloads, but trending-score is dominated by old hits and downloads-by-default surfaces decades-old anchors like wikitext. There is no public dashboard for the release event with a noise filter.

dataset-drop is that page.

Live

Real data sources

All data is fetched at runtime from public APIs. No hardcoded dataset lists, no Math.random() jitter, no "preset seed" fallback. If an API is unreachable, the rows simply stay stale and the refresh log records the error.

| Source | URL | Cron |
|---|---|---|
| HF datasets list (recent) | https://huggingface.co/api/datasets?sort=createdAt&direction=-1&limit=100&full=true × 30 pages | every 2 h |
| HF datasets list (trending) | https://huggingface.co/api/datasets?sort=trendingScore&direction=-1&limit=100&full=true × 5 pages | every 6 h |
| HF datasets list (top week) | https://huggingface.co/api/datasets?sort=downloads&direction=-1&limit=100&full=true × 5 pages | every 12 h |
| HF single dataset (drill-down) | https://huggingface.co/api/datasets/{repo_id} | on-demand + nightly velocity job |
| HF dataset README | https://huggingface.co/datasets/{repo_id}/raw/main/README.md | on-demand, cached 24 h |
| HF daily papers | https://huggingface.co/api/daily_papers?limit=50 | every 12 h |

The HF API requires no auth but accepts an optional bearer token to raise rate-limit headroom. The product runs fully anonymously by default.

Quality filter

Each ingested row gets a 0–100 quality_score. Default UI threshold is 30. Anything excluded (disabled, gated, private) is hidden everywhere.

Additive points:

Penalties:

The known_orgs list is never hard-coded; it is recomputed nightly from the local datasets table.

Stack

URL surface

All routes mounted under BASE_PATH=/dataset-drop. No auth on any endpoint (Arda's pipeline rule — instant inspection).

Read endpoints (60 req/min/IP):

Write/trigger endpoints (6 req/min/IP):

Running

npm install
node server.js
# → dataset-drop 1.0.0 listening on :4893/dataset-drop/  (no auth)

The server creates ./data/dataset-drop.db on first boot, fires a single-page recent fetch if the table is empty, and schedules all five cron jobs.

License

UNLICENSED (private, deployed under holyai.me).