MMLU — A procurement-grade audit of the most-cited LLM benchmark.
Composite 83 / 100. Gold tier — not Platinum. World-class format and license clarity. Crowdsourced labels with no disclosed QC, estimated 25–30% label noise. Heavy contamination across the open training-data ecosystem — the score every model card reports is partially memorized, not earned. Open methodology, signed cert, recourse documented.
evaluation profile: 81 · classification profile: 82 · RAG profile: 85
What we audited
| Dataset | cais/mmlu |
|---|---|
| Full name | Massive Multitask Language Understanding |
| Size | 15,908 multiple-choice questions across 57 subjects (test + dev + validation splits) |
| Modality | Text (English), multiple-choice question answering |
| License | MIT — commercial-friendly, unambiguous |
| Source | Crowdsourced from online practice exams (GRE, AP, USMLE, MBE, etc.) |
| Maintainer | Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt (UC Berkeley) |
| Paper | arXiv:2009.03300 |
| Year released | 2020 |
| Distribution format | Parquet via HuggingFace; CSV original release |
| Citations (Google Scholar) | 4,200+ — the most-cited LLM benchmark in existence |
The headline finding
MMLU is the benchmark that taught the LLM industry how to compare models. Almost every modern frontier model release — GPT-4, Claude, Gemini, Llama, Mistral — reports MMLU as one of the first two or three numbers in the announcement. It earned that position. The benchmark is genuinely well-constructed: 57 academic subjects, drawn from real exams, with the kind of breadth that surfaces capability gaps a narrow benchmark would miss.
That said, the way the field uses MMLU today is not the way the benchmark was designed to be used. Two procurement-relevant gaps separate the headline number from the signal a model-risk team needs.
- Contamination is widespread and quantifiable. MMLU questions appear verbatim in Common Crawl, in derivative datasets, in scraped study-aid sites, in cached question banks. Any modern web-trained model has plausibly seen a non-trivial fraction of MMLU during pretraining. The community has known this for years — it's why MMLU-Pro exists. But MMLU itself still gets reported in every model card without a contamination caveat. We score contamination cleanliness at 35 — the lowest dimension on this audit, and the dimension that matters most for using MMLU as a true capability signal.
- The labels are noisier than the way the score is reported implies. MMLU was crowdsourced from public practice-exam sites without a disclosed QC protocol. Multiple independent analyses (Gema et al. 2024, Wang et al. 2024) have found 25–30%+ label noise — questions with no correct answer among the four options, questions with multiple defensible answers, factually incorrect "correct" answers. When a model card claims "87% on MMLU", that 87% is bounded above by the label-noise ceiling. The signal-to-noise ratio of "GPT-X scored 87 vs Claude-Y scored 88" is much worse than the precision implied.
Why this audit exists. Procurement teams attaching MMLU scores to model risk paperwork are inheriting two assumptions that no model card states: (1) the score is contamination-clean, (2) the labels are well-validated. Neither is true in 2026. The LQS framework standardizes these dimensions so that "MMLU = 83 Gold" carries the same meaning across buyers — and so the conditions under which the score is and isn't trustworthy are explicit, not folkloric.
Dimension-by-dimension reasoning
Format compliance — 95
95 / 100Clean parquet via the HuggingFace mirror, drop-in compatible with the datasets library. Original CSV release still available. Schema is dense: question, four options, single-letter answer key, subject tag. Multiple-choice evaluation harnesses (lm-evaluation-harness, EleutherAI's harness, OpenAI's evals) all consume MMLU without custom code. Deduction is purely for the original CSV's lack of a published schema file at release time.
License clarity — 95
95 / 100MIT license. Permissive, commercial-friendly, unambiguous. No attribution chain to negotiate, no inherited license from upstream sources — questions are paraphrased or sourced from public-domain or freely-redistributable exam material. This is the model other benchmark releases should follow. The 5-point deduction is for the absence of an explicit Terms-of-Use addendum addressing the contamination question — i.e., whether maintainers expect downstream users to disclose if their training corpus may have included MMLU questions.
Maintainer reputation — 92
92 / 100The Hendrycks-led group at UC Berkeley has an exceptional publication record: MMLU itself, MATH, HumanEval contributions, HARM, Hendrycks-Test, the Pile contributions. Sustained engagement with the safety / evaluation community. Responsive to errata. Multiple follow-up papers (MMLU-Pro, MMLU-Redux) acknowledge limitations openly rather than defending the original. Deduction is small and reflects the absence of an active maintenance cadence on the original MMLU repo — it's stable, not maintained.
Completeness (documentation) — 92
92 / 100Original paper documents the construction methodology, subject distribution, and intended use. HF dataset card lists splits, subjects, and example questions. Subject-level results from the paper are reproducible. Missing: a published label-construction protocol (how were correct answers verified?), and a published QC log. These absences are why later contamination + label-noise audits had to be done externally.
Reproducibility — 85
85 / 100The benchmark itself is fully reproducible: the questions and answers are public, the evaluation methodology is described in the paper, multiple open-source harnesses produce consistent scores. The deduction is for the upstream sources. MMLU was assembled from "freely-available online sources" — but the exact source URLs are not catalogued in the release. Anyone wanting to verify question provenance or check whether a specific question is paraphrased from a copyrighted exam has to reconstruct that mapping themselves.
Subject class balance — 75
75 / 10057 subjects, organized into four meta-categories (STEM, humanities, social sciences, other). Per-subject question counts vary from ~100 to ~600. Skews modestly toward STEM (formal logic, college math, abstract algebra, etc.). Some subjects are conspicuously thin: machine learning has 112 questions; college medicine has 173. For per-subject capability comparison, the smaller subjects produce ±5% confidence intervals at typical model accuracy. Acceptable for a breadth benchmark; not adequate as a per-subject diagnostic.
Deduplication / uniqueness — 68
68 / 100No published deduplication pass. Independent re-scrapes of online practice-exam sites against the MMLU corpus find a non-trivial fraction of near-duplicates — paraphrased or alternate-numbered versions of the same underlying question. Within MMLU itself, internal duplication is rare. The concern is between MMLU and the rest of the open web: the same question often exists in multiple variants, each scrapable, which feeds the contamination story below.
Validation / QC process — 68
68 / 100The paper describes a manual cleaning pass but does not publish the QC protocol, inter-annotator agreement, or rejection rate. There is no published double-annotation pass, no adjudication record, no version history of correction edits. Compare: ImageNet has a multi-rater consensus protocol with published κ; SuperGLUE publishes a multi-pass adjudication record. MMLU's labels are best-effort but not procurement-grade auditable. This is the second-largest concern after contamination.
Size adequacy — 65
65 / 10015,908 questions total is small for a benchmark spanning 57 subjects. The aggregate composite score has tight confidence intervals (±0.4% at typical accuracies), but per-subject scores have noticeably wider ones — 95% intervals of ±4–6% on smaller subjects. For the comparison "Model A beat Model B by 0.3 points on MMLU", the sample size is adequate. For "Model A has stronger college medicine capability than Model B", the sample size is borderline. Procurement teams reading per-subject MMLU breakdowns should treat them as directional, not definitive.
Label noise — 58
58 / 100Two independent re-annotation studies in 2024 found 25–30%+ of MMLU questions have at least one of: (a) no correct answer among the four options, (b) multiple defensible correct answers, (c) factually incorrect "correct" answers, (d) ambiguous wording that defeats single-answer scoring. The Hendrycks lab itself acknowledged the issue with MMLU-Redux. Importantly, this means a model scoring 87% on MMLU is, in expectation, scoring 87% on a question pool where ~28% of items have label issues. The ceiling on noise-free accuracy is somewhere around 70–75%, not 100%. Score gaps below ~3–5 percentage points between models are smaller than the label-noise floor.
Label density / rationale — 52
52 / 100One label per question (the letter A/B/C/D). No accompanying rationale, no chain-of-thought trace, no per-option explanation. This is the standard for multiple-choice benchmarks but it under-supports a procurement audit: when a model gets a question wrong, there is no labeled material to localize the failure mode. For comparison, MATH includes step-by-step solutions; HumanEval includes test-driven correctness; ARC includes rationales for the AI2 Reasoning Challenge subset. MMLU's density is procurement-relevant because it limits the kind of error analysis a buyer's evaluation team can perform.
Contamination cleanliness — 35
35 / 100This is the single dimension that makes the difference between Platinum and Gold. MMLU questions are scraped repeatedly across the open web: study-aid sites, AP/GRE practice forums, Quizlet decks, Anki shared collections. Common Crawl indexes most of these. Any model trained on a substantial fraction of post-2020 web text has plausibly encountered MMLU questions during pretraining.
What we can measure directly: in the LabelSets Contamination Report 001 (80 popular post-training datasets scanned against 40+ public eval benchmark fingerprints), MMLU appeared in the top-matches of 23 of 80 datasets, and TIGER-Lab/MMLU-Pro was flagged as the worst MMLU contamination case at Jaccard similarity 0.0517 — the dataset built to replace contaminated MMLU still inherits residual MMLU n-gram overlap by construction. Report 002 (in progress) will scan the upstream pretraining corpora directly.
The community fix — MMLU-Pro and MMLU-Redux — exists, but the original MMLU score is still the headline on most model cards. For procurement: an MMLU score on a model whose training corpus is undisclosed is not a contamination-clean capability signal. Substitute MMLU-Pro or a held-out subset if a clean signal is required.
Procurement profile — what this means for buyers
- For "model card claims X% on MMLU" as a screening signal: 81 (evaluation profile). Adequate as a coarse signal that a model isn't broken. Inadequate as a fine signal for comparing two models within ±5 percentage points — both label noise and contamination push the meaningful signal threshold up to roughly 5-point gaps.
- For per-subject capability claims ("strong in college medicine, weak in formal logic"): Marginal. Per-subject sample sizes are too small for confident statements without aggregating across multiple benchmarks.
- For SR 11-7 / EU AI Act Art. 10 model-risk documentation: Cite MMLU as one signal among several. Pair with at least one decontamination-aware benchmark (MMLU-Pro, GPQA, or BIG-Bench-Hard) and at least one capability benchmark that doesn't pre-date 2023 (the contamination boundary is roughly the 2020–2022 web crawl window).
- For research / academic comparison of new methods: Excellent fit. Open, citable, well-known, and the contamination story is itself a research surface (decontamination, robust evaluation, etc.).
Methodology
This audit was scored under LQS v3.1 with the multi-subject QA-benchmark adapter. Every dimension above maps to a documented rubric in the methodology preprint (DOI 10.5281/zenodo.20278981). The procurement profiles (evaluation, classification, RAG) are computed by re-weighting the same dimensions; the weights are public in the calibration corpus.
# Reproduce this audit locally:
git clone https://github.com/labelsets/lqs-public
cd lqs-public
node scorer.js signals/mmlu.json
# → { composite: 83, tier: "gold", dims: { ... } }
The 7-oracle consensus pass was not run for this report — MMLU has no canonical "training set" against which to fit multiple oracles. The audit is metadata- and signal-based, the same lens used for the FineWeb-Edu report. For any benchmark maintainer who wants the full v3.1 cert with oracle consensus run against a snapshot, contact us.
Recourse. If you are a MMLU maintainer or domain expert and believe a score here is wrong, the recourse process is documented in the methodology preprint §7. File an issue at the public-audit repo with a counter-citation. We will publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure. Every published score carries an immutable hash; corrections are issued as new versions, not silent edits.
What this audit doesn't claim
- It does not claim MMLU is a bad benchmark. Gold tier (83) is "fit for procurement with documented caveats." Above the field mean. The benchmark earned its position; the field's usage of it has outgrown its construction.
- It does not claim the Hendrycks group did anything wrong. Every limitation flagged above is either acknowledged in the original paper or in the group's own follow-ups (MMLU-Pro, MMLU-Redux). The audit translates existing knowledge into a procurement-shaped artifact.
- It does not predict downstream model performance. A model with a high MMLU score may still be excellent. A model with a low MMLU score may still be excellent at the things you actually need. LQS scores benchmark fitness for procurement evidence, not model capability.
- It does not invalidate the benchmark's use in academic research. For comparing new methods under controlled conditions, MMLU remains the standard. The procurement audit is a different evaluation lens.
What's next
This is Report 002 in a public-audit series. Reports planned over the next 90 days:
- Report 003 — RedPajama-V2. 30T tokens. Three-way provenance comparison with FineWeb-Edu and The Pile.
- Report 004 — The Pile (EleutherAI). Foundational. Known Books3 / copyright surface. The most-litigated open pretraining corpus.
- Report 005 — HumanEval (OpenAI). Code benchmark. Different contamination profile, different label-construction protocol.
- Report 006 — A medical imaging corpus to be selected. First audit under the FDA 21 CFR 11 procurement lens.
Want the next report when it lands?
One email per audit. No marketing. Methodology updates included.
Subscribe to audit reports Read the methodology (DOI)