LLM benchmarks explained

Quick answer

A benchmark is a fixed set of tasks used to compare models on one skill — knowledge, reasoning, math, coding, multimodal understanding, or agentic tool use.

No single benchmark captures “how good” a model is. Read several, check what each one measures, and always note whether a score is vendor-reported or independently verified.

The benchmarks we track

These are the evals that appear most often on the model pages in this catalog:

Benchmark	Category	What it measures
MMLU	Knowledge	Multiple-choice questions across 57 subjects, from history to law. A broad breadth-of-knowledge check; largely saturated at the frontier.
GPQA Diamond	Reasoning	Graduate-level, “Google-proof” science questions written by domain experts. A harder reasoning signal than MMLU.
AIME	Math	Competition mathematics (American Invitational Mathematics Examination). Tests multi-step quantitative reasoning.
USAMO	Math	Olympiad-level, proof-based mathematics — graded on the written argument, not just the final answer.
SWE-bench Verified	Coding	Resolving real GitHub issues in real repositories, on a human-validated subset. The most cited agentic-coding benchmark.
SWE-bench Pro	Coding	A harder, contamination-resistant SWE-bench variant over longer-horizon, production-grade software tasks.
Terminal-Bench	Agentic	Completing tasks from a command line — installing, configuring, and operating tools in a real shell.
MCP-Atlas	Agentic	Tool use across Model Context Protocol servers and multi-step agent workflows.
OSWorld	Agentic	Computer-use desktop tasks: navigating real applications with mouse and keyboard.
MMMU	Multimodal	College-level questions that require understanding images, diagrams, and charts alongside text.
LMArena Elo	Preference	A crowd-sourced rating from blind, head-to-head human votes — a proxy for perceived quality rather than a fixed task.

Self-reported vs verified

Most figures a lab publishes at launch are self-reported: run by the vendor, under conditions it chose. That is not necessarily wrong, but it is a claim, not a fact. We keep self-reported and independently verified numbers separate and never merge one into the other. See the methodology for exactly how this is handled.

Reading a leaderboard safely

A few habits keep benchmark numbers honest:

Compare like with like — the same benchmark, ideally the same harness and effort setting.
Watch for contamination: if a benchmark predates a model, its test data may be in the training set.
Prefer task-relevant evals: SWE-bench for coding agents, GPQA/AIME for reasoning, MMMU for vision.
Treat a one- or two-point gap as noise; look for consistent leads across several benchmarks.

Current SWE-bench Verified leaders

Highest sourced SWE-bench Verified claims in the catalog right now:

Model	Lab	SWE-bench Verified	Released
Claude Opus 4.8	Anthropic	81.5%	May 28, 2026
Kimi K2.6	Moonshot	80.2%	Mar 30, 2026
GLM-5	Z.ai	77.8%	Feb 11, 2026
Mistral Medium 3.5	Mistral	77.6%	Mar 18, 2026
Qwen3.6-27B	Qwen	77.2%	May 12, 2026
Kimi K2.5	Moonshot	76.8%	Jan 27, 2026
GPT-5.6	OpenAI	76.4%	Jun 9, 2026
GPT-5.4	OpenAI	74.9%	Mar 5, 2026

For the full picture across reasoning and coding, see the frontier model leaderboard.

Frequently asked questions

What is SWE-bench?

SWE-bench measures whether a model can resolve real GitHub issues by editing a real codebase and passing the project's tests. SWE-bench Verified is a human-validated subset; SWE-bench Pro is a harder, contamination-resistant version. It is the most cited benchmark for agentic coding.

What does GPQA Diamond test?

GPQA Diamond is a set of graduate-level science questions designed to be hard to answer even with web search. It is used as a reasoning benchmark because it rewards genuine domain understanding over recall.

Why do the same model's scores differ between sites?

Scores vary with prompt format, the number of attempts allowed, whether tools or chain-of-thought are used, and the exact eval harness. That is why LLM Releases records who reported a score and links the source, and labels figures self-reported until an independent run confirms them.

Where to go next

Compare open coding models, read about reasoning models, or browse the full model catalog.