needle-board

The public leaderboard of sub-1B function-calling models — the slice
of the LLM universe that fits on phones, smart glasses, and background
agents. Berkeley's BFCL is dominated by frontier models; needle-board
filters down to the tiny models that builders shipping on-device AI
actually evaluate.

Every score is fetched live from a real source. No mocks. No synthesized
numbers. If a model has no published BFCL score and no self-reported
score in its model card, the BFCL column renders — — not a guess.

What it does

Tracks ~40 curated HuggingFace repositories of sub-1B (with a few
larger reference models) function-calling specialists.
Pulls model metadata (params, downloads, license, tags, README) from
the HuggingFace public API every 6 hours.
Pulls the Berkeley Function-Calling Leaderboard (BFCL v4) CSV daily
at 03:00 UTC and cross-references it against the watchlist by
normalized model name.
Extracts self-reported BFCL / function-calling accuracy numbers from
each model's README with conservative regexes — flagged as
"self-reported" in the UI so readers can tell the difference.
Surfaces derived columns: estimated q4_K_M size (params × 0.55) and
an on_device_ok flag (q4 size ≤ 1 GB).
Renders a sortable, filterable leaderboard plus a per-model detail
drawer and a screenshot-ready share card.

Data sources

| Source | URL | Frequency |
|---|---|---|
| HuggingFace model metadata | GET https://huggingface.co/api/models/{repo_id}?full=true | every 6 h per model |
| HuggingFace README | GET https://huggingface.co/{repo_id}/raw/main/README.md (fallback to /master/) | once per HF refresh |
| Berkeley BFCL leaderboard | GET https://gorilla.cs.berkeley.edu/data_overall.csv (canonical CSV the leaderboard HTML page fetches at runtime) | daily 03:00 UTC |
| HF discovery search (log only) | GET https://huggingface.co/api/models?search=function-calling&limit=100 | every 24 h (does not auto-add to watchlist) |

The watchlist of which repos to fetch is hardcoded in
lib/watchlist.js. Everything about each repo is live.

API

All endpoints are mounted at /needle-board and require no auth.

| Method | Path | Returns |
|---|---|---|
| GET | /needle-board/health | { ok: true, service: "needle-board" } |
| GET | /needle-board/api/models | Array of every tracked model. Query params: sort (bfcl_overall, params_m, downloads, likes, size_mb_q4), dir (asc/desc), on_device=1, has_bfcl=1, license=<slug>, max_params=<int>, search=<substring> |
| GET | /needle-board/api/models/:id | Single model with full detail + matched raw BFCL row. :id is the repo id with / replaced by --. |
| GET | /needle-board/api/stats | { tracked_count, with_bfcl, on_device_count, top_by_bfcl, top_by_downloads, last_hf_refresh, last_bfcl_refresh } |
| GET | /needle-board/api/bfcl-raw | Most recent BFCL snapshot rows (transparency endpoint). |
| GET | /needle-board/api/fetch-log?limit=N | Last N fetch attempts with {source, target, status, message, duration_ms, fetched_at}. |
| GET | /needle-board/api/licenses | License → model count groups, for the filter dropdown. |
| POST | /needle-board/api/refresh | Triggers an async re-scrape. Returns { queued: true } immediately or { already_running: true }. Idempotent. Optional ?mode=hf or ?mode=bfcl to refresh just one half. |
| GET | /needle-board/card/:id | Standalone shareable HTML card (no nav, screenshot-ready, OG tags set). |
| GET | /needle-board/ | The SPA. |

How the BFCL ↔ HuggingFace join works

BFCL row names like "Hammer2.1-0.5B-Instruct (FC)" are normalized by
stripping parens, lowercasing, and removing non-alphanumeric characters
(hammer2105binstruct). Two candidate variants are tried per side: the
raw normalized form and the form with trailing suffixes like
-instruct, -fc, -it, -chat, -base stripped. If any candidate
intersects between a model's set and a row's set, they match. The
official leaderboard always beats a self-reported number.

This is intentionally a conservative fuzzy match — false positives are
worse than missing matches, since a wrong BFCL score is more harmful
than a —.

Running locally

npm install
PORT=4766 node server.js
# open http://localhost:4766/needle-board/

On first boot, when the database is empty, the server kicks off a full
refresh in the background. It takes ~30 s to pull all HuggingFace
metadata (4 parallel fetches with 250 ms jitter between batches) and
~1 s to pull the BFCL CSV. After that, refreshes are driven by
node-cron on the schedule above, or by a manual POST /api/refresh.

Set SKIP_CRON=1 to disable the in-process cron (useful for tests).
Set DB_PATH=/some/path.db to override the SQLite location.

Stack

Node.js 22 / Express 4
better-sqlite3 (WAL mode)
node-cron for scheduling
cheerio for HTML parsing fallback
helmet + compression
Vanilla JS SPA in public/ — no build step

What's deliberately out of scope (v1)

No auth. Every endpoint is public.
No user submissions, voting, comments, or accounts.
We do not run evals ourselves. We ingest published numbers and cite
them.
No charts beyond the table view (no time-series, no scatter plots).
No PNG card generation; the HTML card is screenshot-ready.

Files

needle-board/
├── server.js               Express bootstrap, mounts /needle-board, runs first refresh
├── db.js                   better-sqlite3 + WAL + idempotent schema + fetch_log helper
├── cron.js                 node-cron schedules (HF every 6h, BFCL daily 03:00 UTC)
├── routes/
│   ├── health.js           GET /health
│   ├── models.js           /api/models, /api/models/:id, /api/stats, /api/bfcl-raw, /api/fetch-log, /api/licenses
│   ├── refresh.js          POST /api/refresh
│   └── card.js             GET /card/:id — shareable card HTML
├── scrapers/
│   ├── huggingface.js      Per-model metadata + README fetch
│   ├── bfcl.js             leaderboard CSV (with HTML cheerio fallback)
│   └── index.js            Orchestrator, concurrency, retry, DB upsert
├── lib/
│   ├── watchlist.js        Curated repo ID list
│   ├── normalize.js        Name normalization for cross-referencing
│   ├── benchmarks.js       README regex extraction
│   └── derive.js           q4 size + on_device flag + quant tag parsing
└── public/                 Vanilla JS SPA (index.html, app.js, style.css, card.css)