80 post-training datasets scanned. 2 flagged contaminated.
Public n-gram overlap scan of widely-used instruction, RLHF, preference, and reasoning datasets on HuggingFace against 40+ public evaluation benchmark fingerprints (MMLU, HellaSwag, GSM8K, ARC, HumanEval, TruthfulQA, BBH, and 33 more). Open methodology, raw similarity metrics, every dataset listed below.
The headline finding
Of the 80 popular HuggingFace post-training datasets we scanned, 2 are measurably contaminated with public eval benchmark content (Jaccard similarity above the LQS v3.1 contamination threshold). Another 1 sit at moderate and 12 at minor — measurable n-gram overlap that doesn't yet trip the threshold but is worth disclosing to downstream model trainers.
The two flagged datasets are revealing.
-
TIGER-Lab/MMLU-Pro— worst benchmark match: MMLU at similarity 0.0517. The dataset is explicitly built as a harder MMLU successor and shares stems by construction; this is expected. Worth flagging anyway because downstream model evaluators using MMLU-Pro as a "clean" MMLU alternative inherit residual MMLU n-gram overlap. -
garage-bAInd/Open-Platypus— worst benchmark match: MATH at similarity 0.1167. A widely-cited LLM fine-tuning dataset assembled from STEM benchmarks. Models fine-tuned on Open-Platypus and then evaluated on MATH are double-counting their training and eval signal.
Why this matters. Procurement teams citing a model's MATH or MMLU score under SR 11-7, EU AI Act Art. 10, FDA 21 CFR 11, or §1557 paperwork inherit the contamination surface of the model's training corpus. This scan publishes the surface for 80 popular datasets so that disclosure is mechanical, not narrative.
Benchmark appearance — how often does each public eval show up in scan results?
For each of the 40+ public evaluation benchmark fingerprints indexed by the LQS contamination scanner, we counted how many of the 80 scanned datasets had any measurable n-gram overlap (similarity > 0). Top 12 below.
| Benchmark | Datasets with any overlap | Share of scanned (80) |
|---|---|---|
| HumanEval | 27 | 33.8% |
| MATH | 23 | 28.7% |
| ARC Challenge | 23 | 28.7% |
| TruthfulQA | 20 | 25.0% |
| GSM8K | 19 | 23.8% |
| WinoGrande | 18 | 22.5% |
| SQuAD v2.0 | 14 | 17.5% |
| CodeContests | 13 | 16.3% |
| SQuAD v1.1 | 11 | 13.8% |
| MMLU | 7 | 8.8% |
| HellaSwag | 3 | 3.8% |
Interpretation: "Any overlap" means the dataset had non-zero similarity with the benchmark, not that it's contaminated. Most overlaps are below the threshold. The signal worth tracking is the worst-benchmark column in the full table below.
Full results — all 80 scanned datasets
Sorted: contaminated → moderate → minor → unknown → clean, then by max similarity descending within each tier.
| Dataset | Tier | Score | Worst benchmark | Max sim | Sample |
|---|---|---|---|---|---|
garage-bAInd/Open-Platypus |
contaminated | 25 | MATH | 0.1167 | 600 |
TIGER-Lab/MMLU-Pro |
contaminated | 44 | MMLU | 0.0517 | 600 |
jondurbin/py-dpo-v0.1 |
moderate | 68 | HumanEval | 0.0391 | 600 |
nickrosh/Evol-Instruct-Code-80k-v1 |
minor | 75 | HumanEval | 0.0391 | 600 |
AI-MO/NuminaMath-CoT |
minor | 72 | CodeContests | 0.0313 | 600 |
meta-math/MetaMathQA |
minor | 70 | MATH | 0.0283 | 600 |
nvidia/OpenMathInstruct-1 |
minor | 86 | MATH | 0.0234 | 600 |
m-a-p/Code-Feedback |
minor | 82 | HumanEval | 0.0234 | 600 |
mlabonne/guanaco-llama2-1k |
minor | 85 | TruthfulQA | 0.0156 | 300 |
jondurbin/airoboros-2.2.1 |
minor | 87 | TruthfulQA | 0.0156 | 600 |
TIGER-Lab/MathInstruct |
minor | 87 | ARC Challenge | 0.0156 | 600 |
ise-uiuc/Magicoder-Evol-Instruct-110K |
minor | 87 | CodeContests | 0.0156 | 600 |
iamtarun/python_code_instructions_18k_alpaca |
minor | 82 | HumanEval | 0.0100 | 600 |
teknium/OpenHermes-2.5 |
minor | 85 | GSM8K | 0.0078 | 600 |
argilla/OpenHermes2.5-dpo-binarized-alpha |
minor | 87 | GSM8K | 0.0078 | 600 |
microsoft/orca-math-word-problems-200k |
clean | 92 | MATH | 0.0313 | 600 |
sahil2801/CodeAlpaca-20k |
clean | 90 | HumanEval | 0.0313 | 600 |
HuggingFaceH4/no_robots |
clean | 94 | WinoGrande | 0.0234 | 600 |
Open-Orca/FLAN |
clean | 99 | CodeContests | 0.0234 | 600 |
camel-ai/math |
clean | 97 | MATH | 0.0234 | 600 |
camel-ai/chemistry |
clean | 99 | MATH | 0.0234 | 600 |
TokenBender/code_instructions_122k_alpaca_style |
clean | 92 | HumanEval | 0.0234 | 600 |
tatsu-lab/alpaca |
clean | 100 | WinoGrande | 0.0156 | 600 |
yahma/alpaca-cleaned |
clean | 100 | SQuAD v2.0 | 0.0156 | 600 |
vicgalle/alpaca-gpt4 |
clean | 100 | HumanEval | 0.0156 | 600 |
databricks/databricks-dolly-15k |
clean | 92 | WinoGrande | 0.0156 | 600 |
WizardLMTeam/WizardLM_evol_instruct_70k |
clean | 100 | HumanEval | 0.0156 | 597 |
jondurbin/airoboros-3.2 |
clean | 100 | TruthfulQA | 0.0156 | 600 |
migtissera/Synthia-v1.3 |
clean | 97 | SQuAD v2.0 | 0.0156 | 600 |
Open-Orca/OpenOrca |
clean | 97 | ARC Challenge | 0.0156 | 600 |
Open-Orca/SlimOrca |
clean | 92 | HumanEval | 0.0156 | 600 |
Open-Orca/SlimOrca-Dedup |
clean | 92 | HumanEval | 0.0156 | 600 |
teknium/openhermes |
clean | 97 | SQuAD v2.0 | 0.0156 | 556 |
HuggingFaceH4/ultrachat_200k |
clean | 100 | HumanEval | 0.0156 | 600 |
Muennighoff/flan |
clean | 100 | ARC Challenge | 0.0156 | 499 |
camel-ai/physics |
clean | 100 | MATH | 0.0156 | 600 |
glaiveai/glaive-code-assistant |
clean | 92 | HumanEval | 0.0156 | 600 |
Anthropic/hh-rlhf |
clean | 100 | TruthfulQA | 0.0156 | 600 |
nvidia/HelpSteer |
clean | 100 | SQuAD v2.0 | 0.0156 | 562 |
Intel/orca_dpo_pairs |
clean | 97 | ARC Challenge | 0.0156 | 600 |
argilla/distilabel-intel-orca-dpo-pairs |
clean | 95 | CodeContests | 0.0156 | 600 |
argilla/dpo-mix-7k |
clean | 100 | GSM8K | 0.0156 | 600 |
HuggingFaceH4/ultrafeedback_binarized |
clean | 90 | MATH | 0.0156 | 600 |
berkeley-nest/Nectar |
clean | 92 | HumanEval | 0.0156 | 600 |
allenai/WildChat-1M |
clean | 100 | ARC Challenge | 0.0156 | 600 |
0-hero/Matter-0.1 |
clean | 92 | TruthfulQA | 0.0156 | 596 |
open-web-math/open-web-math |
clean | 100 | ARC Challenge | 0.0156 | 600 |
timdettmers/openassistant-guanaco |
clean | 100 | WinoGrande | 0.0078 | 300 |
WizardLMTeam/WizardLM_evol_instruct_V2_196k |
clean | 92 | HumanEval | 0.0078 | 597 |
jondurbin/truthy-dpo-v0.1 |
clean | 100 | TruthfulQA | 0.0078 | 600 |
Norquinal/claude_multiround_chat_30k |
clean | 100 | GSM8K | 0.0078 | 600 |
LDJnr/Pure-Dove |
clean | 100 | WinoGrande | 0.0078 | 600 |
LDJnr/Verified-Camel |
clean | 100 | MMLU | 0.0078 | 254 |
allenai/tulu-v2-sft-mixture |
clean | 100 | ARC Challenge | 0.0078 | 494 |
LDJnr/Capybara |
clean | 100 | MMLU | 0.0078 | 600 |
ise-uiuc/Magicoder-OSS-Instruct-75K |
clean | 92 | HumanEval | 0.0078 | 600 |
glaiveai/glaive-function-calling-v2 |
clean | 97 | SQuAD v1.1 | 0.0078 | 600 |
nvidia/HelpSteer2 |
clean | 100 | TruthfulQA | 0.0078 | 588 |
OpenAssistant/oasst1 |
clean | 100 | HumanEval | 0.0078 | 600 |
OpenAssistant/oasst2 |
clean | 100 | TruthfulQA | 0.0078 | 600 |
allenai/WildChat |
clean | 100 | HellaSwag | 0.0078 | 600 |
migtissera/Tess-v1.5 |
clean | 100 | WinoGrande | 0.0078 | 600 |
Salesforce/wikitext |
clean | 100 | WinoGrande | 0.0078 | 209 |
abisee/cnn_dailymail |
clean | 100 | WinoGrande | 0.0078 | 600 |
JeanKaddour/minipile |
clean | 90 | WinoGrande | 0.0078 | 300 |
HuggingFaceTB/cosmopedia-100k |
clean | 100 | ARC Challenge | 0.0078 | 600 |
camel-ai/biology |
clean | 100 | — | 0.0000 | 600 |
Helsinki-NLP/europarl |
clean | 100 | — | 0.0000 | 452 |
mlabonne/orca-dpo-pairs |
unknown | — | — | — | — |
cognitivecomputations/dolphin |
unknown | — | — | — | — |
stingning/ultrachat |
unknown | — | — | — | — |
conceptofmind/flan2021_submix_original |
unknown | — | — | — | — |
conceptofmind/cot_submix_original |
unknown | — | — | — | — |
conceptofmind/niv2_submix_original |
unknown | — | — | — | — |
conceptofmind/dialog_submix_original |
unknown | — | — | — | — |
camel-ai/ai_society |
unknown | — | — | — | — |
allenai/strategyqa |
unknown | — | — | — | — |
princeton-nlp/SimPO-data |
unknown | — | — | — | — |
HuggingFaceFW/fineweb |
unknown | — | — | — | — |
togethercomputer/RedPajama-Data-Instruct |
unknown | — | — | — | — |
Methodology
Each dataset was sampled at the listed sample size (default 600 records per split). Sampled records were tokenized into n-grams (n=8 by default for text, n=4 for code). N-gram sets were compared against the LQS benchmark fingerprint registry — a deduplicated index of n-grams drawn from the train + validation + test splits of 40+ widely-cited public evaluation suites. Maximum Jaccard similarity across all (split × benchmark) pairs is reported as max_similarity; the worst-matching benchmark is recorded.
Tier thresholds:
clean— max similarity < 0.005, no contaminated phrase hitsminor— max similarity 0.005–0.015 OR phrase hits below thresholdmoderate— max similarity 0.015–0.03 OR more than one minor flagcontaminated— max similarity ≥ 0.03 OR direct phrase-match contamination on a high-stakes benchmark
The thresholds were tuned against an empirical labelled set of known-contaminated and known-clean datasets — see the empirical-threshold-validation report at validation/tools/benchmark-audit/REPORT.md.
Sample size caveat: 600 records is a representative sample, not a full enumeration. A dataset scored "clean" here may have hidden contamination in unsampled records. Re-scanning at a higher sample size or against a private fingerprint set is available via the Enterprise tier.
Recourse. If you maintain a dataset listed here and believe a tier or score is wrong, the recourse process is the same as for LQS audits — documented in the methodology preprint §7. File an issue with a counter-citation (e.g. "the n-gram overlap is from a public-domain template, not from MMLU itself") and we'll publish a v1.1 with the correction. We do not modify results under non-public pressure. Every result carries an immutable cert hash; corrections are issued as new versions.
What's next
Contamination Report 002 will expand scope from post-training to pretraining corpora — Common Crawl, The Pile, RedPajama-V2, OSCAR, Dolma, C4, FineWeb. That's the layer where benchmark contamination most affects the score every model card reports, and where the upstream surface determines what every downstream fine-tune inherits.
If you're a maintainer interested in being scanned before the next report lands, reach out via /contact.