LIVE BENCHMARKS

AI model benchmarks — plus the one number no one else publishes.

Intelligence Index, GPQA Diamond, MATH-500, LiveCodeBench, MMLU-Pro, AIME 2025 — refreshed as labs publish. Then build a council and see how its ceiling compares to the best individual model.

Run your own benchmark Try the council builder

Full benchmark table

Model	Intelligence Index	GPQA Diamond	MATH-500	LiveCodeBench	MMLU-Pro	AIME 2025	Context	Updated
Claude Opus 4.7 Anthropic	76.0	89.0	96.0	84.0	87.5	91.0	200k	2026-06-03
GPT-5.1 OpenAI	73.0	87.0	96.5	81.0	85.0	92.0	256k	2026-06-03
Claude Sonnet 4.5 Anthropic	70.0	84.0	94.0	79.0	84.0	88.0	200k	2026-06-03
Qwen 3.6 Plus Alibaba	66.0	79.0	91.0	72.0	78.0	82.0	1000k	2026-06-03
Kimi K2.5 Moonshot	64.0	76.0	90.0	70.0	76.0	80.0	200k	2026-06-03
GLM 5.1 Zhipu	60.0	72.0	85.0	65.0	73.0	—	128k	2026-06-03
Command A Cohere	—	—	—	—	—	—	256k	2026-06-04
Gemini 3.1 Pro Google	—	—	—	—	—	—	1049k	2026-06-04
Grok 4.3 xAI	—	—	—	—	—	—	1000k	2026-06-04
DeepSeek V4 Pro DeepSeek	—	—	—	—	—	—	1049k	2026-06-04
Llama 3.3 70B Meta	—	—	—	—	—	—	131k	2026-06-04
Mistral Large 3 Mistral	—	—	—	—	—	—	262k	2026-06-04

What each benchmark measures

Intelligence Index — Artificial Analysis composite. Best single-number proxy for general capability.
GPQA Diamond — graduate-level science Q&A; experts score ~65%+ on this hard subset.
MATH-500 — competition math; tests step-by-step reasoning, not pattern matching.
LiveCodeBench — competitive programming released AFTER each model's training cutoff; resistant to contamination.
MMLU-Pro — expert-domain reasoning across 14 disciplines; the harder successor to MMLU.
AIME 2025 — American Invitational Math Exam 2025; recent enough to be unseen by most training corpora.

The Council Ceiling — the number no one else publishes

For each benchmark suite, we compute the maximum score across the council members you select. That number is the upper bound on what a synthesis chairman could produce from the council's responses. A council of 4 frontier models routinely beats the best individual model on 4 or 5 of the 6 suites simultaneously — that gap is the value of multi-model deliberation, made empirical.

How to read the live numbers

A model that wins Intelligence Index but loses LiveCodeBench is great at reasoning narratives, weak at producing working code under time pressure. A model that wins MATH-500 but loses MMLU-Pro is strong at math-shape problems, weaker on cross-domain knowledge. No model wins everything.

A model dominating only one suite is suspicious. A model in the top quartile of three suites is real.

Why not trust vendor benchmarks

Every lab publishes the suites where they win, and trains extensively to win specific benchmarks. The way to spot this is to look for live benchmarks released after the model's training cutoff (LiveCodeBench, AIME 2025) and to cross-reference multiple suites. A model dominating only one suite is suspicious; a model in the top quartile of three suites is real.

Frequently asked questions

How often are benchmarks updated?

Weekly. We refresh scores from Artificial Analysis, LiveCodeBench, and lab-published reports. Each row shows a last-updated date.

Where do the scores come from?

Third-party sources where they exist (Artificial Analysis, LiveCodeBench leaderboards), and lab-published model cards otherwise. We never average vendor charts.

Why is a cell blank?

If a row is blank, the lab has not released a score and the third-party benchmark has not run them yet. We do not fabricate numbers.

How is the Council Ceiling calculated?

For each suite, we take the maximum score across the models you selected. It represents the upper bound on what a synthesis chairman could weave from the council's responses.

Can I run my own benchmark on these models?

Yes. Sign up free, build a custom council, and run your own prompts side by side.

Why not trust vendor benchmarks?

Every lab publishes the suites where it wins. The defense is cross-referencing multiple suites and weighting newer, contamination-resistant ones higher.