Quick answer
A benchmark is a fixed set of tasks used to compare models on one skill — knowledge, reasoning, math, coding, multimodal understanding, or agentic tool use.
No single benchmark captures “how good” a model is. Read several, check what each one measures, and always note whether a score is vendor-reported or independently verified.
The benchmarks we track
These are the evals that appear most often on the model pages in this catalog:
| Benchmark | Category | What it measures |
|---|---|---|
| MMLU | Knowledge | Multiple-choice questions across 57 subjects, from history to law. A broad breadth-of-knowledge check; largely saturated at the frontier. |
| GPQA Diamond | Reasoning | Graduate-level, “Google-proof” science questions written by domain experts. A harder reasoning signal than MMLU. |
| AIME | Math | Competition mathematics (American Invitational Mathematics Examination). Tests multi-step quantitative reasoning. |
| USAMO | Math | Olympiad-level, proof-based mathematics — graded on the written argument, not just the final answer. |
| SWE-bench Verified | Coding | Resolving real GitHub issues in real repositories, on a human-validated subset. The most cited agentic-coding benchmark. |
| SWE-bench Pro | Coding | A harder, contamination-resistant SWE-bench variant over longer-horizon, production-grade software tasks. |
| Terminal-Bench | Agentic | Completing tasks from a command line — installing, configuring, and operating tools in a real shell. |
| MCP-Atlas | Agentic | Tool use across Model Context Protocol servers and multi-step agent workflows. |
| OSWorld | Agentic | Computer-use desktop tasks: navigating real applications with mouse and keyboard. |
| MMMU | Multimodal | College-level questions that require understanding images, diagrams, and charts alongside text. |
| LMArena Elo | Preference | A crowd-sourced rating from blind, head-to-head human votes — a proxy for perceived quality rather than a fixed task. |
Self-reported vs verified
Most figures a lab publishes at launch are self-reported: run by the vendor, under conditions it chose. That is not necessarily wrong, but it is a claim, not a fact. We keep self-reported and independently verified numbers separate and never merge one into the other. See the methodology for exactly how this is handled.
Reading a leaderboard safely
A few habits keep benchmark numbers honest:
- Compare like with like — the same benchmark, ideally the same harness and effort setting.
- Watch for contamination: if a benchmark predates a model, its test data may be in the training set.
- Prefer task-relevant evals: SWE-bench for coding agents, GPQA/AIME for reasoning, MMMU for vision.
- Treat a one- or two-point gap as noise; look for consistent leads across several benchmarks.
Current SWE-bench Verified leaders
Highest sourced SWE-bench Verified claims in the catalog right now:
| Model | Lab | SWE-bench Verified | Released |
|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 81.5% | May 28, 2026 |
| Kimi K2.6 | Moonshot | 80.2% | Mar 30, 2026 |
| GLM-5 | Z.ai | 77.8% | Feb 11, 2026 |
| Mistral Medium 3.5 | Mistral | 77.6% | Mar 18, 2026 |
| Qwen3.6-27B | Qwen | 77.2% | May 12, 2026 |
| Kimi K2.5 | Moonshot | 76.8% | Jan 27, 2026 |
| GPT-5.6 | OpenAI | 76.4% | Jun 9, 2026 |
| GPT-5.4 | OpenAI | 74.9% | Mar 5, 2026 |
For the full picture across reasoning and coding, see the frontier model leaderboard.
Frequently asked questions
What is SWE-bench?
SWE-bench measures whether a model can resolve real GitHub issues by editing a real codebase and passing the project's tests. SWE-bench Verified is a human-validated subset; SWE-bench Pro is a harder, contamination-resistant version. It is the most cited benchmark for agentic coding.
What does GPQA Diamond test?
GPQA Diamond is a set of graduate-level science questions designed to be hard to answer even with web search. It is used as a reasoning benchmark because it rewards genuine domain understanding over recall.
Why do the same model's scores differ between sites?
Scores vary with prompt format, the number of attempts allowed, whether tools or chain-of-thought are used, and the exact eval harness. That is why LLM Releases records who reported a score and links the source, and labels figures self-reported until an independent run confirms them.
Where to go next
Compare open coding models, read about reasoning models, or browse the full model catalog.