Flagship models
All reportsFrontier model leaderboard
Frontier-class models ranked by their strongest tracked benchmark, alongside context window, API price, and how each one can be accessed.
This is a reading aid, not a single-number ranking. Frontier labs report different benchmarks, so the table shows the two evals that appear most consistently across flagships — GPQA Diamond (graduate-level science reasoning) and SWE-bench Verified (resolving real GitHub issues) — and orders models by their strongest of the two.
A model near the top is not automatically “the best” for your use case: price, context window, access model, and whether a number is verified all matter. Use this to narrow the field, then open a model page for its full record.
Benchmark values are included only where the catalog has a sourced claim. Vendor figures remain self-reported until independently verified.
| Model | Lab | Access | Context | Price (in / out) | GPQA | SWE-bench | Released | Source |
|---|---|---|---|---|---|---|---|---|
| Kimi K2.6 | Moonshot | Modified MIT | 256K | — | 90.5% | 80.2% | Mar 30, 2026 | source |
| GLM-5.2 | Z.ai | MIT | 1M | — | 91.2% | - | Jun 17, 2026 | source |
| Claude Opus 4.8 | Anthropic | Proprietary | 500K | $15 / $75 | 89% | 81.5% | May 28, 2026 | source |
| GPT-5.6 | OpenAI | Proprietary | 1.5M | — | 88.1% | 76.4% | Jun 9, 2026 | source |
| Kimi K2.5 | Moonshot | Modified MIT | 256K | — | 87.6% | 76.8% | Jan 27, 2026 | source |
| GPT-5.4 | OpenAI | Proprietary | 400K | — | 86.5% | 74.9% | Mar 5, 2026 | source |
| GLM-5 | Z.ai | MIT | — | — | 86% | 77.8% | Feb 11, 2026 | source |
| GLM-4.7 | Z.ai | MIT | — | — | 85.7% | 73.8% | Jan 8, 2026 | source |
| GLM-5.1 | Z.ai | MIT | — | — | 86.2% | - | Apr 8, 2026 | source |
| Grok 4.3 | xAI | Proprietary | 1M | $1.25 / $2.5 | 86% | - | May 6, 2026 | source |
| Kimi K2 Thinking | Moonshot | Modified MIT | 256K | — | 84.5% | 71.3% | Nov 6, 2025 | source |
| DeepSeek-V3.2 | DeepSeek | MIT | 128K | — | 82.4% | 70% | Dec 1, 2025 | source |
| DeepSeek-R1-0528 | DeepSeek | MIT | 128K | — | 81% | 57.6% | May 28, 2025 | source |
| Kimi K2 Instruct | Moonshot | Modified MIT | 128K | — | 75.1% | 65.8% | Jul 11, 2025 | source |
| DeepSeek-R1 | DeepSeek | MIT | 128K | — | 71.5% | 49.2% | Jan 20, 2025 | source |
| MiniMax-M1-80k | MiniMax | Apache-2.0 | 1M | — | 70% | 56% | Jun 16, 2025 | source |
| Kimi K2 Instruct 0905 | Moonshot | Modified MIT | 256K | — | - | 69.2% | Sep 5, 2025 | source |
| Kimi K2.7 Code | Moonshot | Modified MIT | 262K | $0.95 / $4 | - | - | Jun 18, 2026 | source |
| MiniMax-M3 | MiniMax | MiniMax Community License | 1M | — | - | - | Jun 16, 2026 | source |
| Claude Fable 5 | Anthropic | Proprietary | — | — | - | - | Jun 9, 2026 | source |
| Nemotron 3 Ultra 550B-A55B | NVIDIA | Nemotron Open Model License | 1M | — | - | - | Jun 4, 2026 | source |
| MiniMax-M2.7 | MiniMax | MiniMax Model License | — | — | - | - | May 26, 2026 | source |
| Gemini 3.5 Pro | DeepMind | Proprietary | 2M | — | - | - | May 19, 2026 | source |
| Qwen3.7-Max | Qwen | Proprietary | 1M | $2.5 / $7.5 | - | - | May 19, 2026 | source |
| ERNIE 5.1 | Baidu | Proprietary | 128K | $0.59 / $2.65 | - | - | May 8, 2026 | source |
| DeepSeek V4-Pro | DeepSeek | MIT | 1M | — | - | - | Apr 24, 2026 | source |
| Claude Mythos | Anthropic | Proprietary | — | — | - | - | Apr 7, 2026 | source |
| Nemotron 3 Super 120B-A12B | NVIDIA | Nemotron Open Model License | 1M | — | - | - | Mar 16, 2026 | source |
| Qwen3.5-397B | Qwen | Apache-2.0 | 1M | — | - | - | Feb 20, 2026 | source |
| Gemini 3.1 Pro | DeepMind | Proprietary | 2M | — | - | - | Feb 19, 2026 | source |
| Claude Opus 4.6 | Anthropic | Proprietary | 200K | $15 / $75 | - | - | Feb 5, 2026 | source |
| Mistral Large 3 | Mistral | Mistral Research / Commercial | 256K | — | - | - | Dec 2, 2025 | source |
| DeepSeek-V3.2-Speciale | DeepSeek | MIT | 128K | — | - | - | Dec 1, 2025 | source |
| GLM-4.6 | Z.ai | MIT | 200K | — | - | - | Sep 30, 2025 | source |
| GLM-4.5 | Z.ai | MIT | 128K | — | - | - | Jul 28, 2025 | source |
| Llama 4 Maverick | Meta | Llama 4 Community License | 1M | — | - | - | Apr 5, 2025 | source |
Frequently asked questions
What counts as a frontier model here?
A frontier model is one a lab positions at or near the capability ceiling of the field at release — typically its flagship. LLM Releases tags this with an explicit frontier flag rather than inferring it from benchmark scores, so a model can be frontier-class even when its numbers are not yet independently verified.
Why are some benchmark cells empty?
A cell is blank when the catalog has no sourced claim for that model on that benchmark. We do not copy numbers across benchmarks or fill gaps with estimates, so a missing value means 'not reported with a source', not 'scored zero'.
Are these scores verified?
Most published figures are self-reported by the vendor under conditions they choose. We record them as claims and label them self-reported until an independent evaluation confirms the result. Treat the ranking as a starting point, then follow the source link.
Related