needle-board
The public leaderboard of sub-1B function-calling models — the slice
of the LLM universe that fits on phones, smart glasses, and background
agents. Berkeley's BFCL is dominated by frontier models; needle-board
filters down to the tiny models that builders shipping on-device AI
actually evaluate.
Every score is fetched live from a real source. No mocks. No synthesized
numbers. If a model has no published BFCL score and no self-reported
score in its model card, the BFCL column renders — — not a guess.
What it does
- Tracks ~40 curated HuggingFace repositories of sub-1B (with a few
- larger reference models) function-calling specialists.
- Pulls model metadata (params, downloads, license, tags, README) from
- the HuggingFace public API every 6 hours.
- Pulls the Berkeley Function-Calling Leaderboard (BFCL v4) CSV daily
- at 03:00 UTC and cross-references it against the watchlist by
- normalized model name.
- Extracts self-reported BFCL / function-calling accuracy numbers from
- each model's README with conservative regexes — flagged as
- "self-reported" in the UI so readers can tell the difference.
- Surfaces derived columns: estimated q4_K_M size (
params × 0.55) and - an
on_device_okflag (q4 size ≤ 1 GB). - Renders a sortable, filterable leaderboard plus a per-model detail
- drawer and a screenshot-ready share card.
Data sources
| Source | URL | Frequency |
|---|---|---|
| HuggingFace model metadata | GET https://huggingface.co/api/models/{repo_id}?full=true | every 6 h per model |
| HuggingFace README | GET https://huggingface.co/{repo_id}/raw/main/README.md (fallback to /master/) | once per HF refresh |
| Berkeley BFCL leaderboard | GET https://gorilla.cs.berkeley.edu/data_overall.csv (canonical CSV the leaderboard HTML page fetches at runtime) | daily 03:00 UTC |
| HF discovery search (log only) | GET https://huggingface.co/api/models?search=function-calling&limit=100 | every 24 h (does not auto-add to watchlist) |
The watchlist of which repos to fetch is hardcoded inlib/watchlist.js. Everything about each repo is live.
API
All endpoints are mounted at /needle-board and require no auth.
| Method | Path | Returns |
|---|---|---|
| GET | /needle-board/health | { ok: true, service: "needle-board" } |
| GET | /needle-board/api/models | Array of every tracked model. Query params: sort (bfcl_overall, params_m, downloads, likes, size_mb_q4), dir (asc/desc), on_device=1, has_bfcl=1, license=<slug>, max_params=<int>, search=<substring> |
| GET | /needle-board/api/models/:id | Single model with full detail + matched raw BFCL row. :id is the repo id with / replaced by --. |
| GET | /needle-board/api/stats | { tracked_count, with_bfcl, on_device_count, top_by_bfcl, top_by_downloads, last_hf_refresh, last_bfcl_refresh } |
| GET | /needle-board/api/bfcl-raw | Most recent BFCL snapshot rows (transparency endpoint). |
| GET | /needle-board/api/fetch-log?limit=N | Last N fetch attempts with {source, target, status, message, duration_ms, fetched_at}. |
| GET | /needle-board/api/licenses | License → model count groups, for the filter dropdown. |
| POST | /needle-board/api/refresh | Triggers an async re-scrape. Returns { queued: true } immediately or { already_running: true }. Idempotent. Optional ?mode=hf or ?mode=bfcl to refresh just one half. |
| GET | /needle-board/card/:id | Standalone shareable HTML card (no nav, screenshot-ready, OG tags set). |
| GET | /needle-board/ | The SPA. |
How the BFCL ↔ HuggingFace join works
BFCL row names like "Hammer2.1-0.5B-Instruct (FC)" are normalized by
stripping parens, lowercasing, and removing non-alphanumeric characters
(hammer2105binstruct). Two candidate variants are tried per side: the
raw normalized form and the form with trailing suffixes like-instruct, -fc, -it, -chat, -base stripped. If any candidate
intersects between a model's set and a row's set, they match. The
official leaderboard always beats a self-reported number.
This is intentionally a conservative fuzzy match — false positives are
worse than missing matches, since a wrong BFCL score is more harmful
than a —.
Running locally
npm install
PORT=4766 node server.js
# open http://localhost:4766/needle-board/
On first boot, when the database is empty, the server kicks off a full
refresh in the background. It takes ~30 s to pull all HuggingFace
metadata (4 parallel fetches with 250 ms jitter between batches) and
~1 s to pull the BFCL CSV. After that, refreshes are driven bynode-cron on the schedule above, or by a manual POST /api/refresh.
Set SKIP_CRON=1 to disable the in-process cron (useful for tests).
Set DB_PATH=/some/path.db to override the SQLite location.
Stack
- Node.js 22 / Express 4
better-sqlite3(WAL mode)node-cronfor schedulingcheeriofor HTML parsing fallbackhelmet+compression- Vanilla JS SPA in
public/— no build step
What's deliberately out of scope (v1)
- No auth. Every endpoint is public.
- No user submissions, voting, comments, or accounts.
- We do not run evals ourselves. We ingest published numbers and cite
- them.
- No charts beyond the table view (no time-series, no scatter plots).
- No PNG card generation; the HTML card is screenshot-ready.
Files
needle-board/
├── server.js Express bootstrap, mounts /needle-board, runs first refresh
├── db.js better-sqlite3 + WAL + idempotent schema + fetch_log helper
├── cron.js node-cron schedules (HF every 6h, BFCL daily 03:00 UTC)
├── routes/
│ ├── health.js GET /health
│ ├── models.js /api/models, /api/models/:id, /api/stats, /api/bfcl-raw, /api/fetch-log, /api/licenses
│ ├── refresh.js POST /api/refresh
│ └── card.js GET /card/:id — shareable card HTML
├── scrapers/
│ ├── huggingface.js Per-model metadata + README fetch
│ ├── bfcl.js leaderboard CSV (with HTML cheerio fallback)
│ └── index.js Orchestrator, concurrency, retry, DB upsert
├── lib/
│ ├── watchlist.js Curated repo ID list
│ ├── normalize.js Name normalization for cross-referencing
│ ├── benchmarks.js README regex extraction
│ └── derive.js q4 size + on_device flag + quant tag parsing
└── public/ Vanilla JS SPA (index.html, app.js, style.css, card.css)