LLM Releases

Flagship models

All reports

Frontier model leaderboard

Frontier-class models ranked by their strongest tracked benchmark, alongside context window, API price, and how each one can be accessed.

This is a reading aid, not a single-number ranking. Frontier labs report different benchmarks, so the table shows the two evals that appear most consistently across flagships — GPQA Diamond (graduate-level science reasoning) and SWE-bench Verified (resolving real GitHub issues) — and orders models by their strongest of the two.

A model near the top is not automatically “the best” for your use case: price, context window, access model, and whether a number is verified all matter. Use this to narrow the field, then open a model page for its full record.

Benchmark values are included only where the catalog has a sourced claim. Vendor figures remain self-reported until independently verified.

ModelLabAccessContextPrice (in / out)GPQASWE-benchReleasedSource
Kimi K2.6MoonshotModified MIT256K90.5%80.2%Mar 30, 2026source
GLM-5.2Z.aiMIT1M91.2%-Jun 17, 2026source
Claude Opus 4.8AnthropicProprietary500K$15 / $7589%81.5%May 28, 2026source
GPT-5.6OpenAIProprietary1.5M88.1%76.4%Jun 9, 2026source
Kimi K2.5MoonshotModified MIT256K87.6%76.8%Jan 27, 2026source
GPT-5.4OpenAIProprietary400K86.5%74.9%Mar 5, 2026source
GLM-5Z.aiMIT86%77.8%Feb 11, 2026source
GLM-4.7Z.aiMIT85.7%73.8%Jan 8, 2026source
GLM-5.1Z.aiMIT86.2%-Apr 8, 2026source
Grok 4.3xAIProprietary1M$1.25 / $2.586%-May 6, 2026source
Kimi K2 ThinkingMoonshotModified MIT256K84.5%71.3%Nov 6, 2025source
DeepSeek-V3.2DeepSeekMIT128K82.4%70%Dec 1, 2025source
DeepSeek-R1-0528DeepSeekMIT128K81%57.6%May 28, 2025source
Kimi K2 InstructMoonshotModified MIT128K75.1%65.8%Jul 11, 2025source
DeepSeek-R1DeepSeekMIT128K71.5%49.2%Jan 20, 2025source
MiniMax-M1-80kMiniMaxApache-2.01M70%56%Jun 16, 2025source
Kimi K2 Instruct 0905MoonshotModified MIT256K-69.2%Sep 5, 2025source
Kimi K2.7 CodeMoonshotModified MIT262K$0.95 / $4--Jun 18, 2026source
MiniMax-M3MiniMaxMiniMax Community License1M--Jun 16, 2026source
Claude Fable 5AnthropicProprietary--Jun 9, 2026source
Nemotron 3 Ultra 550B-A55BNVIDIANemotron Open Model License1M--Jun 4, 2026source
MiniMax-M2.7MiniMaxMiniMax Model License--May 26, 2026source
Gemini 3.5 ProDeepMindProprietary2M--May 19, 2026source
Qwen3.7-MaxQwenProprietary1M$2.5 / $7.5--May 19, 2026source
ERNIE 5.1BaiduProprietary128K$0.59 / $2.65--May 8, 2026source
DeepSeek V4-ProDeepSeekMIT1M--Apr 24, 2026source
Claude MythosAnthropicProprietary--Apr 7, 2026source
Nemotron 3 Super 120B-A12BNVIDIANemotron Open Model License1M--Mar 16, 2026source
Qwen3.5-397BQwenApache-2.01M--Feb 20, 2026source
Gemini 3.1 ProDeepMindProprietary2M--Feb 19, 2026source
Claude Opus 4.6AnthropicProprietary200K$15 / $75--Feb 5, 2026source
Mistral Large 3MistralMistral Research / Commercial256K--Dec 2, 2025source
DeepSeek-V3.2-SpecialeDeepSeekMIT128K--Dec 1, 2025source
GLM-4.6Z.aiMIT200K--Sep 30, 2025source
GLM-4.5Z.aiMIT128K--Jul 28, 2025source
Llama 4 MaverickMetaLlama 4 Community License1M--Apr 5, 2025source

Frequently asked questions

What counts as a frontier model here?

A frontier model is one a lab positions at or near the capability ceiling of the field at release — typically its flagship. LLM Releases tags this with an explicit frontier flag rather than inferring it from benchmark scores, so a model can be frontier-class even when its numbers are not yet independently verified.

Why are some benchmark cells empty?

A cell is blank when the catalog has no sourced claim for that model on that benchmark. We do not copy numbers across benchmarks or fill gaps with estimates, so a missing value means 'not reported with a source', not 'scored zero'.

Are these scores verified?

Most published figures are self-reported by the vendor under conditions they choose. We record them as claims and label them self-reported until an independent evaluation confirms the result. Treat the ranking as a starting point, then follow the source link.

Related