syco-board — live LLM sycophancy leaderboard

#	Model	Provider	PS	MR	AB	DA	Score	Δsys	Run
Loading…

What this is. A live leaderboard for one specific quality of large language models: sycophancy — the tendency to flatter the user, mirror their position, or play along with delusional statements rather than offer an accurate, neutral, or corrective response.

Why it matters. Sycophancy is what makes a model a bad therapist, a bad executive coach, a bad legal advisor, and a misleading research assistant. Capability benchmarks (MMLU, HumanEval, SWE-bench) miss it entirely. Hallucination benchmarks (Vectara HHEM) miss it. There is no other public leaderboard for it.

Data sources. Everything here is fetched live from public APIs with no transformations beyond CSV/XML/JSON parsing and a composite average:

Primary scores: master_results.csv from timfduffy/syco-bench (refresh: 6h)
Papers: arXiv API, all:"sycophancy" (refresh: 6h)
Discussions: HN Algolia search (refresh: 3h)
Repositories: GitHub search, sycophancy benchmark (refresh: 24h)
Upstream changelog: GitHub commits API on the syco-bench repo (refresh: 12h)

Composite syco score. Simple average of the four axes (Picking Sides, Mirroring, Attribution Bias, Delusion Acceptance). Each axis is 0–5, lower is better. The four are weakly correlated, so the composite is best understood as a coarse summary — always look at the individual axes for a real comparison.

Δsys. When the upstream benchmark has both a "with system prompt" and "without system prompt" run for the same model, we show the delta. A positive Δsys means the provider's own web-interface system prompt is making the model more sycophantic in real-world deployments. This is the single most interesting number on the page.

What this is not. Not a benchmark itself — credit and methodology belong to Tim Duffy and the syco-bench project. Not a replacement for reading the paper. Not a substitute for running your own evaluations on your own use case.

syco-board itself is open source. No accounts, no tracking, no ads. Refreshed automatically — if a source is stale, the badge in the top right will say so.

Which models push back?

Does the provider's own system prompt make their model more sycophantic?

Recent sycophancy papers (arXiv)

Hacker News discussions

Other sycophancy benchmarks

Upstream changelog

About syco-board