tool-use-arena

Live LLM tool / function-calling leaderboard backed by the public Berkeley Function-Calling Leaderboard (BFCL v4). Refreshed every 6 hours from gorilla.cs.berkeley.edu. Movers panel, cost-vs-accuracy explorer, per-model history. No mocks. No auth.

Every team shipping an AI agent in May 2026 has the same question: "Which LLM actually calls my tools correctly?" Generic chat ELOs and code benchmarks do not predict tool-use quality. The community-standard answer is the Berkeley Function-Calling Leaderboard (BFCL) — now at v4 with Web Search, Memory, Multi-Turn, and Hallucination categories.

tool-use-arena is a small self-hosted dashboard that:

Pulls the BFCL data_overall.csv every 6 hours,
Persists every snapshot in SQLite (WAL),
Surfaces a sortable / filterable leaderboard with sub-category breakdown,
Computes 7-day rank/score movers (risers + fallers),
Plots cost-vs-accuracy as a scatter with a budget calculator ("best model under $X"),
Tracks per-model history across snapshots.

Live data sources (no mocks)

| Source | URL | Cadence |
|---|---|---|
| BFCL overall scores | https://gorilla.cs.berkeley.edu/data_overall.csv | every 6 h |
| BFCL changelog (Markdown) | https://raw.githubusercontent.com/ShishirPatil/gorilla/main/berkeley-function-call-leaderboard/CHANGELOG.md | every 6 h, cached |

Snapshots are deduplicated by byte-identity of the raw CSV, so unchanged fetches do not bloat the database.

Stack

Node.js 20+
Express 4
better-sqlite3 (WAL mode)
node-cron
helmet + compression
Vanilla JS SPA + Chart.js v4 from CDN
Dark theme, English labels

Run locally

npm install
cp .env.example .env
node server.js
# open http://localhost:4814/tool-use-arena/

The first snapshot is fetched 5 seconds after boot. Subsequent snapshots are scheduled via REFRESH_CRON (default 0 /6 ).

Endpoints

All routes are mounted under BASE_PATH (default /tool-use-arena) and are public — no authentication.

| Method | Path | Purpose |
|---|---|---|
| GET | /health | liveness probe — { ok, lastFetch, modelCount } |
| GET | / | SPA |
| GET | /api/leaderboard | current snapshot rows, sortable / filterable |
| GET | /api/model/:slug | one model + its history across snapshots |
| GET | /api/movers?days=7 | top 10 rank/score risers + fallers vs N days ago |
| GET | /api/orgs | per-organization aggregates |
| GET | /api/licenses | open vs proprietary aggregates |
| GET | /api/budget?max_cost=20&limit=5 | best-accuracy models under cost cap |
| GET | /api/snapshots | list of every snapshot fetched |
| GET | /api/changelog | parsed BFCL changelog (cached 6 h) |
| GET | /api/stats | counts for the header strip |
| POST | /api/refresh | manual refresh, rate-limited to 1/min |

Query parameters for `/api/leaderboard`

org — exact organization name (e.g. Anthropic, Google, Zhipu AI)
license — open | proprietary | exact license string
sort — rank (default) | overall_acc | multi_turn_acc | web_search_acc | memory_acc | live_acc | total_cost | latency_mean
order — asc (default) | desc
limit — 1..500 (default 200)

SPA tabs

Leaderboard — sortable table with mover badges (▲N / ▼N vs 7d).
Movers — four cards: rank risers/fallers, score risers/fallers.
Cost vs Accuracy — scatter chart + "best under budget" calculator. Click a point to open model detail.
BFCL Changelog — last 20 entries from the upstream changelog.
Model detail (#model/:slug) — sub-score breakdown across 5 categories + history chart.

Deploy

DEPLOY_MANIFEST.json follows the RNDLAB orchestrator schema. The Mac watcher under cowork-deploy-bridge/ picks up the trigger written to queue.jsonl and runs rsync → systemd → nginx → showcase POST → thumbnail.

No secrets are required.

Tool Use Arena

tool-use-arena

Live data sources (no mocks)

Stack

Run locally

Endpoints

Query parameters for `/api/leaderboard`

SPA tabs

Deploy

License

tool-use-arena

Live data sources (no mocks)

Stack

Run locally

Endpoints

Query parameters for /api/leaderboard

SPA tabs

Deploy

License

Query parameters for `/api/leaderboard`