LIVE BENCHMARKS
Intelligence Index, GPQA Diamond, MATH-500, LiveCodeBench, MMLU-Pro, AIME 2025 — refreshed as labs publish. Then build a council and see how its ceiling compares to the best individual model.
Run your own benchmark Try the council builder
| Model | Intelligence Index | GPQA Diamond | MATH-500 | LiveCodeBench | MMLU-Pro | AIME 2025 | Context | Updated |
|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 Anthropic |
76.0 | 89.0 | 96.0 | 84.0 | 87.5 | 91.0 | 200k | 2026-06-03 |
| GPT-5.1 OpenAI |
73.0 | 87.0 | 96.5 | 81.0 | 85.0 | 92.0 | 256k | 2026-06-03 |
| Claude Sonnet 4.5 Anthropic |
70.0 | 84.0 | 94.0 | 79.0 | 84.0 | 88.0 | 200k | 2026-06-03 |
| Qwen 3.6 Plus Alibaba |
66.0 | 79.0 | 91.0 | 72.0 | 78.0 | 82.0 | 1000k | 2026-06-03 |
| Kimi K2.5 Moonshot |
64.0 | 76.0 | 90.0 | 70.0 | 76.0 | 80.0 | 200k | 2026-06-03 |
| GLM 5.1 Zhipu |
60.0 | 72.0 | 85.0 | 65.0 | 73.0 | — | 128k | 2026-06-03 |
| Command A Cohere |
— | — | — | — | — | — | 256k | 2026-06-04 |
| Gemini 3.1 Pro |
— | — | — | — | — | — | 1049k | 2026-06-04 |
| Grok 4.3 xAI |
— | — | — | — | — | — | 1000k | 2026-06-04 |
| DeepSeek V4 Pro DeepSeek |
— | — | — | — | — | — | 1049k | 2026-06-04 |
| Llama 3.3 70B Meta |
— | — | — | — | — | — | 131k | 2026-06-04 |
| Mistral Large 3 Mistral |
— | — | — | — | — | — | 262k | 2026-06-04 |
For each benchmark suite, we compute the maximum score across the council members you select. That number is the upper bound on what a synthesis chairman could produce from the council's responses. A council of 4 frontier models routinely beats the best individual model on 4 or 5 of the 6 suites simultaneously — that gap is the value of multi-model deliberation, made empirical.
A model that wins Intelligence Index but loses LiveCodeBench is great at reasoning narratives, weak at producing working code under time pressure. A model that wins MATH-500 but loses MMLU-Pro is strong at math-shape problems, weaker on cross-domain knowledge. No model wins everything.
A model dominating only one suite is suspicious. A model in the top quartile of three suites is real.
Every lab publishes the suites where they win, and trains extensively to win specific benchmarks. The way to spot this is to look for live benchmarks released after the model's training cutoff (LiveCodeBench, AIME 2025) and to cross-reference multiple suites. A model dominating only one suite is suspicious; a model in the top quartile of three suites is real.
Weekly. We refresh scores from Artificial Analysis, LiveCodeBench, and lab-published reports. Each row shows a last-updated date.
Third-party sources where they exist (Artificial Analysis, LiveCodeBench leaderboards), and lab-published model cards otherwise. We never average vendor charts.
If a row is blank, the lab has not released a score and the third-party benchmark has not run them yet. We do not fabricate numbers.
For each suite, we take the maximum score across the models you selected. It represents the upper bound on what a synthesis chairman could weave from the council's responses.
Yes. Sign up free, build a custom council, and run your own prompts side by side.
Every lab publishes the suites where it wins. The defense is cross-referencing multiple suites and weighting newer, contamination-resistant ones higher.