← back to gallery

Crossover Clock

Every AI benchmark vs human baseline — has the model crossed yet?

aibenchmarksleaderboardai-evaluationfrontier-aihuman-baselinelive-data
Open product ↗

crossover-clock

Every AI benchmark, every human baseline, the gap, and the clock.

A live, public dashboard that tracks every major AI benchmark and answers one
question for each: has SOTA crossed the human baseline yet — and if not, when
will it?

What it tracks

Ten benchmarks at launch, across five categories:

| Benchmark | Category | Human baseline | Source |
|----------------------|---------------|----------------|------------------------------------------------------------------|
| OSWorld | computer-use | 72.36% | Xie et al. NeurIPS 2024 §4.2 |
| SWE-Bench Verified | code | 50.0% | OpenAI Verified release (Aug 2024) |
| SWE-Bench Pro | code | 50.0% | Scale AI Pro release (Sep 2025) |
| ARC-AGI-1 | reasoning | 80.0% | arcprize.org panel |
| ARC-AGI-2 | reasoning | 66.0% | ARC-AGI-2 paper (May 2025) |
| Humanity's Last Exam | knowledge | 65.0% | Phan et al. (CAIS/Scale, Jan 2025) |
| GPQA Diamond | knowledge | 65.0% | Rein et al. 2023 |
| MMLU | knowledge | 89.8% | Hendrycks et al. 2021 |
| FrontierMath | math | 75.0% | Epoch AI (Nov 2024) |
| AIME 2025 | math | ~5/15 (~33%) | AoPS distribution / USAMO qualifying cut |

Honest data policy

There are exactly two classes of numbers on this page, both labeled in the UI:

  1. Static, published human baselines (in lib/baselines.js). These are
  2. peer-reviewed or arxiv-paper values from the cited source. They never change
  3. at runtime. Each entry carries value, source_url, and note.
  4. Live SOTA scores (in fetchers/*.js). Every score is fetched from the
  5. canonical public leaderboard for that benchmark on a 6-hour cron and on
  6. startup. If a fetcher hits a 403, timeout, or parse error, it throws,
  7. the failure is logged to the fetch_log table, and **no snapshot row is
  8. written**. There is no synthesis, no Math.random() jitter, no seeded
  9. fallback. The corresponding card simply shows the previous snapshot until
  10. the next successful fetch.

No other data class exists in this project.

Live data sources (refreshed every 6h)

API

All endpoints are public, mounted at /crossover-clock:

| Method | Path | Description |
|--------|-------------------------------|------------------------------------------------------|
| GET | / | SPA |
| GET | /health | { status, uptime, benchmarks_count, last_fetch_at } |
| GET | /api/benchmarks | Array of every benchmark + current snapshot + ETA |
| GET | /api/benchmark/:slug | Single benchmark + 200-snapshot history |
| GET | /api/history/:slug?limit=N | Latest N snapshots |
| GET | /api/crossovers | All crossover_events |
| GET | /api/fetch-log?limit=N | Last N fetcher attempts (ok / error) |
| POST | /api/refresh | Trigger all fetchers (rate-limited to one in-flight) |
| POST | /api/refresh/:slug | Trigger one fetcher |

No body or auth required on any of these.

Local dev

npm install
cp .env.example .env
node server.js
# → http://localhost:4891/crossover-clock/

better-sqlite3 needs a native build step; the Cowork sandbox can't reach
nodejs.org for prebuilt binaries, so npm install may warn. On Linux arm64 /
macOS arm64 production hosts the prebuilt binary is downloaded successfully.

Stack

Node 20+, Express 4, better-sqlite3 (WAL), node-cron, helmet, compression,
cheerio. Vanilla JS SPA. Dark theme.

License

MIT