💬 Curated Catalog · NLP / Text

MMLU — Massive Multitask Language Understanding

Multiple-choice benchmark across 57 academic and professional subjects — the standard LLM knowledge eval.

LQS 74 · silver ✓ Commercial OK 15.9K multiple-choice questions 20 MB CSV · JSON Released 2020

Browse commercial NLP / Text → Visit original source ↗

Source: github.com · maintained by Dan Hendrycks et al. (UC Berkeley)

About this dataset

MMLU is the dominant zero/few-shot knowledge benchmark for large language models. It covers 57 subjects spanning STEM, humanities, social sciences, law, medicine, and more, in a single four-option multiple-choice format. Released by Hendrycks et al. in 2020 and reported in nearly every frontier-model system card since. Commonly cited alongside HellaSwag and ARC as part of the LLM leaderboard 'big three.'

Maintainer

Dan Hendrycks et al. (UC Berkeley)

License

MIT

Formats

CSV · JSON

Paper

Read on arxiv.org →

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

out of 100

silver tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 92

No public completeness metric; using prior for 'research_release' datasets.

Uniqueness 68

Minimal deduplication disclosed.

Validation 68

Crowdsourced labels without disclosed QC protocol.

Size adequacy 65

15,908 items — below 100,000 target for NLP / Text, but usable.

Format compliance 95

Industry-standard format — drop-in compatible with mainstream tooling.

Label density 52

Average 1.0 labels per item (sparse).

Class balance 75

Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where MMLU — Massive Multitask Language Understanding is the default or competitive choice.

Zero-shot evaluation
Few-shot in-context learning
Knowledge benchmarking
Multi-domain QA

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

57 subjects, 15,908 MCQs, ~15 dev + ~100 test per subject, 5-shot standard.

License

MMLU — Massive Multitask Language Understanding is distributed under MIT. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Explicit commercial license in writing
LQS-verified quality in your specific use-case
Instant download — no DUA, credentialed access, or research gating
PII scanned, deduplicated, and production-ready

Browse paid NLP / Text → Sell your dataset

Similar public datasets

Other entries in the NLP / Text catalog.

Frequently Asked Questions

MMLU — Massive Multitask Language Understanding is distributed under MIT, which generally permits commercial use. Always verify the current license terms with the maintainer (Dan Hendrycks et al. (UC Berkeley)) before using in a commercial product.

MMLU — Massive Multitask Language Understanding contains 15,908 multiple-choice questions. 57 subjects, 15,908 MCQs, ~15 dev + ~100 test per subject, 5-shot standard.

MMLU — Massive Multitask Language Understanding is maintained by Dan Hendrycks et al. (UC Berkeley) and is available at https://github.com/hendrycks/test. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.

LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.