tool-use-arena
Live LLM tool / function-calling leaderboard backed by the public Berkeley Function-Calling Leaderboard (BFCL v4).
Refreshed every 6 hours from gorilla.cs.berkeley.edu. Movers panel, cost-vs-accuracy explorer, per-model history. No mocks. No auth.
Every team shipping an AI agent in May 2026 has the same question: "Which LLM actually calls my tools correctly?" Generic chat ELOs and code benchmarks do not predict tool-use quality. The community-standard answer is the Berkeley Function-Calling Leaderboard (BFCL) — now at v4 with Web Search, Memory, Multi-Turn, and Hallucination categories.
tool-use-arena is a small self-hosted dashboard that:
- Pulls the BFCL
data_overall.csvevery 6 hours, - Persists every snapshot in SQLite (WAL),
- Surfaces a sortable / filterable leaderboard with sub-category breakdown,
- Computes 7-day rank/score movers (risers + fallers),
- Plots cost-vs-accuracy as a scatter with a budget calculator (
"best model under $X"), - Tracks per-model history across snapshots.
Live data sources (no mocks)
| Source | URL | Cadence |
|---|---|---|
| BFCL overall scores | https://gorilla.cs.berkeley.edu/data_overall.csv | every 6 h |
| BFCL changelog (Markdown) | https://raw.githubusercontent.com/ShishirPatil/gorilla/main/berkeley-function-call-leaderboard/CHANGELOG.md | every 6 h, cached |
Snapshots are deduplicated by byte-identity of the raw CSV, so unchanged fetches do not bloat the database.
Stack
- Node.js 20+
- Express 4
- better-sqlite3 (WAL mode)
- node-cron
- helmet + compression
- Vanilla JS SPA + Chart.js v4 from CDN
- Dark theme, English labels
Run locally
npm install
cp .env.example .env
node server.js
# open http://localhost:4814/tool-use-arena/
The first snapshot is fetched 5 seconds after boot. Subsequent snapshots are scheduled via REFRESH_CRON (default 0 /6 ).
Endpoints
All routes are mounted under BASE_PATH (default /tool-use-arena) and are public — no authentication.
| Method | Path | Purpose |
|---|---|---|
| GET | /health | liveness probe — { ok, lastFetch, modelCount } |
| GET | / | SPA |
| GET | /api/leaderboard | current snapshot rows, sortable / filterable |
| GET | /api/model/:slug | one model + its history across snapshots |
| GET | /api/movers?days=7 | top 10 rank/score risers + fallers vs N days ago |
| GET | /api/orgs | per-organization aggregates |
| GET | /api/licenses | open vs proprietary aggregates |
| GET | /api/budget?max_cost=20&limit=5 | best-accuracy models under cost cap |
| GET | /api/snapshots | list of every snapshot fetched |
| GET | /api/changelog | parsed BFCL changelog (cached 6 h) |
| GET | /api/stats | counts for the header strip |
| POST | /api/refresh | manual refresh, rate-limited to 1/min |
Query parameters for /api/leaderboard
org— exact organization name (e.g.Anthropic,Google,Zhipu AI)license—open|proprietary| exact license stringsort—rank(default) |overall_acc|multi_turn_acc|web_search_acc|memory_acc|live_acc|total_cost|latency_meanorder—asc(default) |desclimit— 1..500 (default 200)
SPA tabs
- Leaderboard — sortable table with mover badges (▲N / ▼N vs 7d).
- Movers — four cards: rank risers/fallers, score risers/fallers.
- Cost vs Accuracy — scatter chart + "best under budget" calculator. Click a point to open model detail.
- BFCL Changelog — last 20 entries from the upstream changelog.
- Model detail (
#model/:slug) — sub-score breakdown across 5 categories + history chart.
Deploy
DEPLOY_MANIFEST.json follows the RNDLAB orchestrator schema. The Mac watcher under cowork-deploy-bridge/ picks up the trigger written to queue.jsonl and runs rsync → systemd → nginx → showcase POST → thumbnail.
No secrets are required.
License
MIT. Upstream data © UC Berkeley Sky Computing Lab, used under their public-data terms (see code & data).