← back to gallery

Tool Use Arena

Live LLM function-calling leaderboard with movers and cost-vs-accuracy explorer

aillmtool-usefunction-callingbfclleaderboardbenchmarks
Open product ↗

tool-use-arena

Live LLM tool / function-calling leaderboard backed by the public Berkeley Function-Calling Leaderboard (BFCL v4). Refreshed every 6 hours from gorilla.cs.berkeley.edu. Movers panel, cost-vs-accuracy explorer, per-model history. No mocks. No auth.

Every team shipping an AI agent in May 2026 has the same question: "Which LLM actually calls my tools correctly?" Generic chat ELOs and code benchmarks do not predict tool-use quality. The community-standard answer is the Berkeley Function-Calling Leaderboard (BFCL) — now at v4 with Web Search, Memory, Multi-Turn, and Hallucination categories.

tool-use-arena is a small self-hosted dashboard that:

Live data sources (no mocks)

| Source | URL | Cadence |
|---|---|---|
| BFCL overall scores | https://gorilla.cs.berkeley.edu/data_overall.csv | every 6 h |
| BFCL changelog (Markdown) | https://raw.githubusercontent.com/ShishirPatil/gorilla/main/berkeley-function-call-leaderboard/CHANGELOG.md | every 6 h, cached |

Snapshots are deduplicated by byte-identity of the raw CSV, so unchanged fetches do not bloat the database.

Stack

Run locally

npm install
cp .env.example .env
node server.js
# open http://localhost:4814/tool-use-arena/

The first snapshot is fetched 5 seconds after boot. Subsequent snapshots are scheduled via REFRESH_CRON (default 0 /6 ).

Endpoints

All routes are mounted under BASE_PATH (default /tool-use-arena) and are public — no authentication.

| Method | Path | Purpose |
|---|---|---|
| GET | /health | liveness probe — { ok, lastFetch, modelCount } |
| GET | / | SPA |
| GET | /api/leaderboard | current snapshot rows, sortable / filterable |
| GET | /api/model/:slug | one model + its history across snapshots |
| GET | /api/movers?days=7 | top 10 rank/score risers + fallers vs N days ago |
| GET | /api/orgs | per-organization aggregates |
| GET | /api/licenses | open vs proprietary aggregates |
| GET | /api/budget?max_cost=20&limit=5 | best-accuracy models under cost cap |
| GET | /api/snapshots | list of every snapshot fetched |
| GET | /api/changelog | parsed BFCL changelog (cached 6 h) |
| GET | /api/stats | counts for the header strip |
| POST | /api/refresh | manual refresh, rate-limited to 1/min |

Query parameters for /api/leaderboard

SPA tabs

  1. Leaderboard — sortable table with mover badges (▲N / ▼N vs 7d).
  2. Movers — four cards: rank risers/fallers, score risers/fallers.
  3. Cost vs Accuracy — scatter chart + "best under budget" calculator. Click a point to open model detail.
  4. BFCL Changelog — last 20 entries from the upstream changelog.
  5. Model detail (#model/:slug) — sub-score breakdown across 5 categories + history chart.

Deploy

DEPLOY_MANIFEST.json follows the RNDLAB orchestrator schema. The Mac watcher under cowork-deploy-bridge/ picks up the trigger written to queue.jsonl and runs rsync → systemd → nginx → showcase POST → thumbnail.

No secrets are required.

License

MIT. Upstream data © UC Berkeley Sky Computing Lab, used under their public-data terms (see code & data).