LLM Releases

Reading the numbers

All guides

LLM benchmarks explained

The scores in a model release tell you something — but only if you know what each benchmark measures and how much to trust a single number. This guide covers the evals we track and how to read them.

Quick answer

A benchmark is a fixed set of tasks used to compare models on one skill — knowledge, reasoning, math, coding, multimodal understanding, or agentic tool use.

No single benchmark captures “how good” a model is. Read several, check what each one measures, and always note whether a score is vendor-reported or independently verified.

The benchmarks we track

These are the evals that appear most often on the model pages in this catalog:

BenchmarkCategoryWhat it measures
MMLUKnowledgeMultiple-choice questions across 57 subjects, from history to law. A broad breadth-of-knowledge check; largely saturated at the frontier.
GPQA DiamondReasoningGraduate-level, “Google-proof” science questions written by domain experts. A harder reasoning signal than MMLU.
AIMEMathCompetition mathematics (American Invitational Mathematics Examination). Tests multi-step quantitative reasoning.
USAMOMathOlympiad-level, proof-based mathematics — graded on the written argument, not just the final answer.
SWE-bench VerifiedCodingResolving real GitHub issues in real repositories, on a human-validated subset. The most cited agentic-coding benchmark.
SWE-bench ProCodingA harder, contamination-resistant SWE-bench variant over longer-horizon, production-grade software tasks.
Terminal-BenchAgenticCompleting tasks from a command line — installing, configuring, and operating tools in a real shell.
MCP-AtlasAgenticTool use across Model Context Protocol servers and multi-step agent workflows.
OSWorldAgenticComputer-use desktop tasks: navigating real applications with mouse and keyboard.
MMMUMultimodalCollege-level questions that require understanding images, diagrams, and charts alongside text.
LMArena EloPreferenceA crowd-sourced rating from blind, head-to-head human votes — a proxy for perceived quality rather than a fixed task.

Self-reported vs verified

Most figures a lab publishes at launch are self-reported: run by the vendor, under conditions it chose. That is not necessarily wrong, but it is a claim, not a fact. We keep self-reported and independently verified numbers separate and never merge one into the other. See the methodology for exactly how this is handled.

Reading a leaderboard safely

A few habits keep benchmark numbers honest:

  • Compare like with like — the same benchmark, ideally the same harness and effort setting.
  • Watch for contamination: if a benchmark predates a model, its test data may be in the training set.
  • Prefer task-relevant evals: SWE-bench for coding agents, GPQA/AIME for reasoning, MMMU for vision.
  • Treat a one- or two-point gap as noise; look for consistent leads across several benchmarks.

Current SWE-bench Verified leaders

Highest sourced SWE-bench Verified claims in the catalog right now:

ModelLabSWE-bench VerifiedReleased
Claude Opus 4.8Anthropic81.5%May 28, 2026
Kimi K2.6Moonshot80.2%Mar 30, 2026
GLM-5Z.ai77.8%Feb 11, 2026
Mistral Medium 3.5Mistral77.6%Mar 18, 2026
Qwen3.6-27BQwen77.2%May 12, 2026
Kimi K2.5Moonshot76.8%Jan 27, 2026
GPT-5.6OpenAI76.4%Jun 9, 2026
GPT-5.4OpenAI74.9%Mar 5, 2026

For the full picture across reasoning and coding, see the frontier model leaderboard.

Frequently asked questions

What is SWE-bench?

SWE-bench measures whether a model can resolve real GitHub issues by editing a real codebase and passing the project's tests. SWE-bench Verified is a human-validated subset; SWE-bench Pro is a harder, contamination-resistant version. It is the most cited benchmark for agentic coding.

What does GPQA Diamond test?

GPQA Diamond is a set of graduate-level science questions designed to be hard to answer even with web search. It is used as a reasoning benchmark because it rewards genuine domain understanding over recall.

Why do the same model's scores differ between sites?

Scores vary with prompt format, the number of attempts allowed, whether tools or chain-of-thought are used, and the exact eval harness. That is why LLM Releases records who reported a score and links the source, and labels figures self-reported until an independent run confirms them.

Where to go next

Compare open coding models, read about reasoning models, or browse the full model catalog.