Public LQS Audit · Report 002

MMLU — A procurement-grade audit of the most-cited LLM benchmark.

Composite 83 / 100. Gold tier — not Platinum. World-class format and license clarity. Crowdsourced labels with no disclosed QC, estimated 25–30% label noise. Heavy contamination across the open training-data ecosystem — the score every model card reports is partially memorized, not earned. Open methodology, signed cert, recourse documented.

Published May 19, 2026 · LabelSets Research · 11 min read · Author: Alex Adrion
83 / 100
Gold
LQS v3.1 composite · default profile
evaluation profile: 81 · classification profile: 82 · RAG profile: 85
Format compliance
95
License clarity
95
Maintainer reputation
92
Completeness (docs)
92
Reproducibility
85
Subject class balance
75
Deduplication / uniqueness
68
Validation / QC process
68
Size adequacy (per subject)
65
Label noise
58
Label density / rationale
52
Contamination cleanliness
35

What we audited

Datasetcais/mmlu
Full nameMassive Multitask Language Understanding
Size15,908 multiple-choice questions across 57 subjects (test + dev + validation splits)
ModalityText (English), multiple-choice question answering
LicenseMIT — commercial-friendly, unambiguous
SourceCrowdsourced from online practice exams (GRE, AP, USMLE, MBE, etc.)
MaintainerDan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt (UC Berkeley)
PaperarXiv:2009.03300
Year released2020
Distribution formatParquet via HuggingFace; CSV original release
Citations (Google Scholar)4,200+ — the most-cited LLM benchmark in existence

The headline finding

MMLU is the benchmark that taught the LLM industry how to compare models. Almost every modern frontier model release — GPT-4, Claude, Gemini, Llama, Mistral — reports MMLU as one of the first two or three numbers in the announcement. It earned that position. The benchmark is genuinely well-constructed: 57 academic subjects, drawn from real exams, with the kind of breadth that surfaces capability gaps a narrow benchmark would miss.

That said, the way the field uses MMLU today is not the way the benchmark was designed to be used. Two procurement-relevant gaps separate the headline number from the signal a model-risk team needs.

  1. Contamination is widespread and quantifiable. MMLU questions appear verbatim in Common Crawl, in derivative datasets, in scraped study-aid sites, in cached question banks. Any modern web-trained model has plausibly seen a non-trivial fraction of MMLU during pretraining. The community has known this for years — it's why MMLU-Pro exists. But MMLU itself still gets reported in every model card without a contamination caveat. We score contamination cleanliness at 35 — the lowest dimension on this audit, and the dimension that matters most for using MMLU as a true capability signal.
  2. The labels are noisier than the way the score is reported implies. MMLU was crowdsourced from public practice-exam sites without a disclosed QC protocol. Multiple independent analyses (Gema et al. 2024, Wang et al. 2024) have found 25–30%+ label noise — questions with no correct answer among the four options, questions with multiple defensible answers, factually incorrect "correct" answers. When a model card claims "87% on MMLU", that 87% is bounded above by the label-noise ceiling. The signal-to-noise ratio of "GPT-X scored 87 vs Claude-Y scored 88" is much worse than the precision implied.

Why this audit exists. Procurement teams attaching MMLU scores to model risk paperwork are inheriting two assumptions that no model card states: (1) the score is contamination-clean, (2) the labels are well-validated. Neither is true in 2026. The LQS framework standardizes these dimensions so that "MMLU = 83 Gold" carries the same meaning across buyers — and so the conditions under which the score is and isn't trustworthy are explicit, not folkloric.

Dimension-by-dimension reasoning

Format compliance — 95

95 / 100

Clean parquet via the HuggingFace mirror, drop-in compatible with the datasets library. Original CSV release still available. Schema is dense: question, four options, single-letter answer key, subject tag. Multiple-choice evaluation harnesses (lm-evaluation-harness, EleutherAI's harness, OpenAI's evals) all consume MMLU without custom code. Deduction is purely for the original CSV's lack of a published schema file at release time.

Sources: HF dataset card · lm-eval-harness MMLU loader · original CSV release

License clarity — 95

95 / 100

MIT license. Permissive, commercial-friendly, unambiguous. No attribution chain to negotiate, no inherited license from upstream sources — questions are paraphrased or sourced from public-domain or freely-redistributable exam material. This is the model other benchmark releases should follow. The 5-point deduction is for the absence of an explicit Terms-of-Use addendum addressing the contamination question — i.e., whether maintainers expect downstream users to disclose if their training corpus may have included MMLU questions.

Sources: MMLU GitHub repo LICENSE file · paper Section 7 (release statement)

Maintainer reputation — 92

92 / 100

The Hendrycks-led group at UC Berkeley has an exceptional publication record: MMLU itself, MATH, HumanEval contributions, HARM, Hendrycks-Test, the Pile contributions. Sustained engagement with the safety / evaluation community. Responsive to errata. Multiple follow-up papers (MMLU-Pro, MMLU-Redux) acknowledge limitations openly rather than defending the original. Deduction is small and reflects the absence of an active maintenance cadence on the original MMLU repo — it's stable, not maintained.

Sources: Hendrycks lab publication history · MMLU-Pro paper · GitHub repo activity

Completeness (documentation) — 92

92 / 100

Original paper documents the construction methodology, subject distribution, and intended use. HF dataset card lists splits, subjects, and example questions. Subject-level results from the paper are reproducible. Missing: a published label-construction protocol (how were correct answers verified?), and a published QC log. These absences are why later contamination + label-noise audits had to be done externally.

Sources: arXiv:2009.03300 · HF dataset card · paper Sections 3-4

Reproducibility — 85

85 / 100

The benchmark itself is fully reproducible: the questions and answers are public, the evaluation methodology is described in the paper, multiple open-source harnesses produce consistent scores. The deduction is for the upstream sources. MMLU was assembled from "freely-available online sources" — but the exact source URLs are not catalogued in the release. Anyone wanting to verify question provenance or check whether a specific question is paraphrased from a copyrighted exam has to reconstruct that mapping themselves.

Sources: paper Section 3 (construction methodology) · lm-evaluation-harness MMLU implementation

Subject class balance — 75

75 / 100

57 subjects, organized into four meta-categories (STEM, humanities, social sciences, other). Per-subject question counts vary from ~100 to ~600. Skews modestly toward STEM (formal logic, college math, abstract algebra, etc.). Some subjects are conspicuously thin: machine learning has 112 questions; college medicine has 173. For per-subject capability comparison, the smaller subjects produce ±5% confidence intervals at typical model accuracy. Acceptable for a breadth benchmark; not adequate as a per-subject diagnostic.

Sources: paper Section 3.1 (subject distribution) · HF dataset card subject counts

Deduplication / uniqueness — 68

68 / 100

No published deduplication pass. Independent re-scrapes of online practice-exam sites against the MMLU corpus find a non-trivial fraction of near-duplicates — paraphrased or alternate-numbered versions of the same underlying question. Within MMLU itself, internal duplication is rare. The concern is between MMLU and the rest of the open web: the same question often exists in multiple variants, each scrapable, which feeds the contamination story below.

Sources: MMLU-Pro decontamination analysis · LQS dedup scanner output

Validation / QC process — 68

68 / 100

The paper describes a manual cleaning pass but does not publish the QC protocol, inter-annotator agreement, or rejection rate. There is no published double-annotation pass, no adjudication record, no version history of correction edits. Compare: ImageNet has a multi-rater consensus protocol with published κ; SuperGLUE publishes a multi-pass adjudication record. MMLU's labels are best-effort but not procurement-grade auditable. This is the second-largest concern after contamination.

Sources: paper Section 3 (construction) · absence of public QC log · comparison to ImageNet / SuperGLUE QC processes

Size adequacy — 65

65 / 100

15,908 questions total is small for a benchmark spanning 57 subjects. The aggregate composite score has tight confidence intervals (±0.4% at typical accuracies), but per-subject scores have noticeably wider ones — 95% intervals of ±4–6% on smaller subjects. For the comparison "Model A beat Model B by 0.3 points on MMLU", the sample size is adequate. For "Model A has stronger college medicine capability than Model B", the sample size is borderline. Procurement teams reading per-subject MMLU breakdowns should treat them as directional, not definitive.

Sources: paper subject-level statistics · LQS size-adequacy rubric for QA benchmarks

Label noise — 58

58 / 100

Two independent re-annotation studies in 2024 found 25–30%+ of MMLU questions have at least one of: (a) no correct answer among the four options, (b) multiple defensible correct answers, (c) factually incorrect "correct" answers, (d) ambiguous wording that defeats single-answer scoring. The Hendrycks lab itself acknowledged the issue with MMLU-Redux. Importantly, this means a model scoring 87% on MMLU is, in expectation, scoring 87% on a question pool where ~28% of items have label issues. The ceiling on noise-free accuracy is somewhere around 70–75%, not 100%. Score gaps below ~3–5 percentage points between models are smaller than the label-noise floor.

Sources: Gema et al. 2024 (MMLU-Redux) · Wang et al. 2024 (MMLU-Pro motivation) · Hendrycks group acknowledgment in MMLU-Pro

Label density / rationale — 52

52 / 100

One label per question (the letter A/B/C/D). No accompanying rationale, no chain-of-thought trace, no per-option explanation. This is the standard for multiple-choice benchmarks but it under-supports a procurement audit: when a model gets a question wrong, there is no labeled material to localize the failure mode. For comparison, MATH includes step-by-step solutions; HumanEval includes test-driven correctness; ARC includes rationales for the AI2 Reasoning Challenge subset. MMLU's density is procurement-relevant because it limits the kind of error analysis a buyer's evaluation team can perform.

Sources: MMLU dataset structure · comparison to MATH / ARC / HumanEval label density

Contamination cleanliness — 35

35 / 100

This is the single dimension that makes the difference between Platinum and Gold. MMLU questions are scraped repeatedly across the open web: study-aid sites, AP/GRE practice forums, Quizlet decks, Anki shared collections. Common Crawl indexes most of these. Any model trained on a substantial fraction of post-2020 web text has plausibly encountered MMLU questions during pretraining.

What we can measure directly: in the LabelSets Contamination Report 001 (80 popular post-training datasets scanned against 40+ public eval benchmark fingerprints), MMLU appeared in the top-matches of 23 of 80 datasets, and TIGER-Lab/MMLU-Pro was flagged as the worst MMLU contamination case at Jaccard similarity 0.0517 — the dataset built to replace contaminated MMLU still inherits residual MMLU n-gram overlap by construction. Report 002 (in progress) will scan the upstream pretraining corpora directly.

The community fix — MMLU-Pro and MMLU-Redux — exists, but the original MMLU score is still the headline on most model cards. For procurement: an MMLU score on a model whose training corpus is undisclosed is not a contamination-clean capability signal. Substitute MMLU-Pro or a held-out subset if a clean signal is required.

Sources: MMLU-Pro paper Section 2 (contamination rationale) · LabelSets Contamination Report 001 · LQS contamination scanner (40+ benchmark fingerprints)

Procurement profile — what this means for buyers

Methodology

This audit was scored under LQS v3.1 with the multi-subject QA-benchmark adapter. Every dimension above maps to a documented rubric in the methodology preprint (DOI 10.5281/zenodo.20278981). The procurement profiles (evaluation, classification, RAG) are computed by re-weighting the same dimensions; the weights are public in the calibration corpus.

# Reproduce this audit locally:
git clone https://github.com/labelsets/lqs-public
cd lqs-public
node scorer.js signals/mmlu.json
# → { composite: 83, tier: "gold", dims: { ... } }

The 7-oracle consensus pass was not run for this report — MMLU has no canonical "training set" against which to fit multiple oracles. The audit is metadata- and signal-based, the same lens used for the FineWeb-Edu report. For any benchmark maintainer who wants the full v3.1 cert with oracle consensus run against a snapshot, contact us.

Recourse. If you are a MMLU maintainer or domain expert and believe a score here is wrong, the recourse process is documented in the methodology preprint §7. File an issue at the public-audit repo with a counter-citation. We will publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure. Every published score carries an immutable hash; corrections are issued as new versions, not silent edits.

What this audit doesn't claim

What's next

This is Report 002 in a public-audit series. Reports planned over the next 90 days:

Want the next report when it lands?

One email per audit. No marketing. Methodology updates included.

Subscribe to audit reports Read the methodology (DOI)
Share on X Share on LinkedIn Share on r/ML Share on HN