Flagship models

Frontier model leaderboard

Frontier-class models ranked by their strongest tracked benchmark, alongside context window, API price, and how each one can be accessed.

This is a reading aid, not a single-number ranking. Frontier labs report different benchmarks, so the table shows the two evals that appear most consistently across flagships — GPQA Diamond (graduate-level science reasoning) and SWE-bench Verified (resolving real GitHub issues) — and orders models by their strongest of the two.

A model near the top is not automatically “the best” for your use case: price, context window, access model, and whether a number is verified all matter. Use this to narrow the field, then open a model page for its full record.

Benchmark values are included only where the catalog has a sourced claim. Vendor figures remain self-reported until independently verified.

Model	Lab	Access	Context	Price (in / out)	GPQA	SWE-bench	Released	Source
Kimi K2.6	Moonshot	Modified MIT	256K	—	90.5%	80.2%	Mar 30, 2026	source
GLM-5.2	Z.ai	MIT	1M	—	91.2%	-	Jun 17, 2026	source
Claude Opus 4.8	Anthropic	Proprietary	500K	$15 / $75	89%	81.5%	May 28, 2026	source
GPT-5.6	OpenAI	Proprietary	1.5M	—	88.1%	76.4%	Jun 9, 2026	source
Kimi K2.5	Moonshot	Modified MIT	256K	—	87.6%	76.8%	Jan 27, 2026	source
GPT-5.4	OpenAI	Proprietary	400K	—	86.5%	74.9%	Mar 5, 2026	source
GLM-5	Z.ai	MIT	—	—	86%	77.8%	Feb 11, 2026	source
GLM-4.7	Z.ai	MIT	—	—	85.7%	73.8%	Jan 8, 2026	source
GLM-5.1	Z.ai	MIT	—	—	86.2%	-	Apr 8, 2026	source
Grok 4.3	xAI	Proprietary	1M	$1.25 / $2.5	86%	-	May 6, 2026	source
Kimi K2 Thinking	Moonshot	Modified MIT	256K	—	84.5%	71.3%	Nov 6, 2025	source
DeepSeek-V3.2	DeepSeek	MIT	128K	—	82.4%	70%	Dec 1, 2025	source
DeepSeek-R1-0528	DeepSeek	MIT	128K	—	81%	57.6%	May 28, 2025	source
Kimi K2 Instruct	Moonshot	Modified MIT	128K	—	75.1%	65.8%	Jul 11, 2025	source
DeepSeek-R1	DeepSeek	MIT	128K	—	71.5%	49.2%	Jan 20, 2025	source
MiniMax-M1-80k	MiniMax	Apache-2.0	1M	—	70%	56%	Jun 16, 2025	source
Kimi K2 Instruct 0905	Moonshot	Modified MIT	256K	—	-	69.2%	Sep 5, 2025	source
Kimi K2.7 Code	Moonshot	Modified MIT	262K	$0.95 / $4	-	-	Jun 18, 2026	source
MiniMax-M3	MiniMax	MiniMax Community License	1M	—	-	-	Jun 16, 2026	source
Claude Fable 5	Anthropic	Proprietary	—	—	-	-	Jun 9, 2026	source
Nemotron 3 Ultra 550B-A55B	NVIDIA	Nemotron Open Model License	1M	—	-	-	Jun 4, 2026	source
MiniMax-M2.7	MiniMax	MiniMax Model License	—	—	-	-	May 26, 2026	source
Gemini 3.5 Pro	DeepMind	Proprietary	2M	—	-	-	May 19, 2026	source
Qwen3.7-Max	Qwen	Proprietary	1M	$2.5 / $7.5	-	-	May 19, 2026	source
ERNIE 5.1	Baidu	Proprietary	128K	$0.59 / $2.65	-	-	May 8, 2026	source
DeepSeek V4-Pro	DeepSeek	MIT	1M	—	-	-	Apr 24, 2026	source
Claude Mythos	Anthropic	Proprietary	—	—	-	-	Apr 7, 2026	source
Nemotron 3 Super 120B-A12B	NVIDIA	Nemotron Open Model License	1M	—	-	-	Mar 16, 2026	source
Qwen3.5-397B	Qwen	Apache-2.0	1M	—	-	-	Feb 20, 2026	source
Gemini 3.1 Pro	DeepMind	Proprietary	2M	—	-	-	Feb 19, 2026	source
Claude Opus 4.6	Anthropic	Proprietary	200K	$15 / $75	-	-	Feb 5, 2026	source
Mistral Large 3	Mistral	Mistral Research / Commercial	256K	—	-	-	Dec 2, 2025	source
DeepSeek-V3.2-Speciale	DeepSeek	MIT	128K	—	-	-	Dec 1, 2025	source
GLM-4.6	Z.ai	MIT	200K	—	-	-	Sep 30, 2025	source
GLM-4.5	Z.ai	MIT	128K	—	-	-	Jul 28, 2025	source
Llama 4 Maverick	Meta	Llama 4 Community License	1M	—	-	-	Apr 5, 2025	source

Frequently asked questions

What counts as a frontier model here?

A frontier model is one a lab positions at or near the capability ceiling of the field at release — typically its flagship. LLM Releases tags this with an explicit frontier flag rather than inferring it from benchmark scores, so a model can be frontier-class even when its numbers are not yet independently verified.

Why are some benchmark cells empty?

A cell is blank when the catalog has no sourced claim for that model on that benchmark. We do not copy numbers across benchmarks or fill gaps with estimates, so a missing value means 'not reported with a source', not 'scored zero'.

Are these scores verified?

Most published figures are self-reported by the vendor under conditions they choose. We record them as claims and label them self-reported until an independent evaluation confirms the result. Treat the ranking as a starting point, then follow the source link.