← back to gallery

CTF AI-Trivial Score

How fast Claude one-shots your CTF — a public leaderboard for AI-resistant puzzles.

dev-toolsctfai-agentssecurityclaudebenchmarkleaderboard
Open product ↗

ctf-ai-solvable

Score CTF challenges by how easily a frontier Claude agent one-shots them — so organizers can design AI-resistant puzzles and recruiters know which scoreboards still measure humans.

Submit a CTF challenge (statement + optional file attachments), the server runs Claude in a server-side code_execution_20250522 sandbox loop, looks for a flag, and emits an AI-Trivial Score (0–100, higher = easier for AI) plus a public share card you can drop into Slack/Discord/X.

The board also ingests real upcoming CTFTime events and a real index of CTF writeups harvested from the public ctfs/write-ups-* GitHub repos so any historical challenge can be "Run AI on this" in one click.

---

What it does

Data sources (all real, fetched at runtime)

| Source | URL | Refresh |
|---|---|---|
| CTFTime events API | https://ctftime.org/api/v1/events/?limit=50&start=<now>&finish=<now+120d> | daily 06:00 UTC + on first boot |
| ctfs/write-ups GitHub trees | https://api.github.com/repos/ctfs/write-ups-{2024,2017,2016,2015}/git/trees/master?recursive=1 — falls back through the list until one succeeds | daily 06:15 UTC + on first boot |
| Writeup README contents | https://raw.githubusercontent.com/<repo>/master/<path> | on demand, cached forever in writeup_contents |
| Anthropic Messages API | https://api.anthropic.com/v1/messages with tools:[{type:"code_execution_20250522"}] and betas:[code-execution-2025-05-22, files-api-2025-04-14] | every submit |
| Anthropic Files API | https://api.anthropic.com/v1/files | every submit that has attachments (file_ids cached on the challenge row) |

ANTHROPIC_API_KEY is the only secret needed. No GitHub auth — anonymous rate limit (60/hr) is plenty for one daily tree fetch + occasional README fetches.

HTTP API (all mounted at /ctf-ai-solvable)

| Method | Path | Purpose |
|---|---|---|
| GET | /health | {ok:true} |
| GET | /api/board?limit=&offset=&status=&category=&sort= | Joined challenges + best run, newest first (or by score) |
| GET | /api/board/stats | Counters: total challenges, solved %, avg score, avg wall, last-fetch timestamps |
| POST | /api/submit | multipart/form-data: title, statement, category?, model? (opus/sonnet), files[]?. Returns {run_id, challenge_id, slug} |
| GET | /api/challenges/:slug | {challenge, runs: [...]} |
| GET | /api/runs/:id | Poll target — status, score, tokens, turns, wall_ms, flag, band |
| GET | /api/runs/:id/transcript | Full structured per-turn transcript |
| GET | /api/events?limit= | CTFTime upcoming events (sorted by start time) |
| GET | /api/writeups?q=&category=&limit= | Writeups index, LIKE-search on name/CTF/path |
| POST | /api/writeups/:id/prefill | Lazily fetch the writeup's README from raw.githubusercontent, trim past the "Solution" header, return text + suggested title/category |
| GET | /c/:slug | HTML permalink with og:* and twitter:card meta tags for link-unfurl previews |
| GET | /c/:slug/card.svg | 1200×630 SVG share card, server-rendered, no external fonts |

Run locally

npm install
ANTHROPIC_API_KEY=sk-ant-... PORT=4798 node server.js
# open http://localhost:4798/ctf-ai-solvable/

The server boots immediately; on first boot it lazily pulls CTFTime + writeups so the Events and Writeups tabs are populated before any agent run.

Architecture

server.js           express app + cron + bootstrap
config.js           env, constants, score-band helpers
db.js               better-sqlite3 (WAL), prepared statements
routes/             health · board · challenges · runs · events · writeups
scrapers/           ctftime · writeups (GitHub tree walker)
agent/              runner (Claude loop) · score · flag · files (upload helper)
views/share.js      HTML permalink + SVG OG card renderers
public/             vanilla-JS SPA (index.html, app.js, style.css)
data/               SQLite DB + uploaded attachments (gitignored)

Out of scope

No auth, no payments, no multi-agent, no local bash exec (every Python call goes through Anthropic's sandbox), no live-infra exploitation, no WebSockets, no tests, no Docker. See SPEC.md §6.

Honest limitations