Contamination Report 001 — 80 post-training datasets scanned against 40+ public benchmarks

The headline finding

Of the 80 popular HuggingFace post-training datasets we scanned, 2 are measurably contaminated with public eval benchmark content (Jaccard similarity above the LQS v3.1 contamination threshold). Another 1 sit at moderate and 12 at minor — measurable n-gram overlap that doesn't yet trip the threshold but is worth disclosing to downstream model trainers.

The two flagged datasets are revealing.

TIGER-Lab/MMLU-Pro — worst benchmark match: MMLU at similarity 0.0517. The dataset is explicitly built as a harder MMLU successor and shares stems by construction; this is expected. Worth flagging anyway because downstream model evaluators using MMLU-Pro as a "clean" MMLU alternative inherit residual MMLU n-gram overlap.
garage-bAInd/Open-Platypus — worst benchmark match: MATH at similarity 0.1167. A widely-cited LLM fine-tuning dataset assembled from STEM benchmarks. Models fine-tuned on Open-Platypus and then evaluated on MATH are double-counting their training and eval signal.

Why this matters. Procurement teams citing a model's MATH or MMLU score under SR 11-7, EU AI Act Art. 10, FDA 21 CFR 11, or §1557 paperwork inherit the contamination surface of the model's training corpus. This scan publishes the surface for 80 popular datasets so that disclosure is mechanical, not narrative.

Benchmark appearance — how often does each public eval show up in scan results?

For each of the 40+ public evaluation benchmark fingerprints indexed by the LQS contamination scanner, we counted how many of the 80 scanned datasets had any measurable n-gram overlap (similarity > 0). Top 12 below.

Benchmark	Datasets with any overlap	Share of scanned (80)
HumanEval	27	33.8%
MATH	23	28.7%
ARC Challenge	23	28.7%
TruthfulQA	20	25.0%
GSM8K	19	23.8%
WinoGrande	18	22.5%
SQuAD v2.0	14	17.5%
CodeContests	13	16.3%
SQuAD v1.1	11	13.8%
MMLU	7	8.8%
HellaSwag	3	3.8%

Interpretation: "Any overlap" means the dataset had non-zero similarity with the benchmark, not that it's contaminated. Most overlaps are below the threshold. The signal worth tracking is the worst-benchmark column in the full table below.

Full results — all 80 scanned datasets

Sorted: contaminated → moderate → minor → unknown → clean, then by max similarity descending within each tier.

Dataset	Tier	Score	Worst benchmark	Max sim	Sample
`garage-bAInd/Open-Platypus`	contaminated	25	MATH	0.1167	600
`TIGER-Lab/MMLU-Pro`	contaminated	44	MMLU	0.0517	600
`jondurbin/py-dpo-v0.1`	moderate	68	HumanEval	0.0391	600
`nickrosh/Evol-Instruct-Code-80k-v1`	minor	75	HumanEval	0.0391	600
`AI-MO/NuminaMath-CoT`	minor	72	CodeContests	0.0313	600
`meta-math/MetaMathQA`	minor	70	MATH	0.0283	600
`nvidia/OpenMathInstruct-1`	minor	86	MATH	0.0234	600
`m-a-p/Code-Feedback`	minor	82	HumanEval	0.0234	600
`mlabonne/guanaco-llama2-1k`	minor	85	TruthfulQA	0.0156	300
`jondurbin/airoboros-2.2.1`	minor	87	TruthfulQA	0.0156	600
`TIGER-Lab/MathInstruct`	minor	87	ARC Challenge	0.0156	600
`ise-uiuc/Magicoder-Evol-Instruct-110K`	minor	87	CodeContests	0.0156	600
`iamtarun/python_code_instructions_18k_alpaca`	minor	82	HumanEval	0.0100	600
`teknium/OpenHermes-2.5`	minor	85	GSM8K	0.0078	600
`argilla/OpenHermes2.5-dpo-binarized-alpha`	minor	87	GSM8K	0.0078	600
`microsoft/orca-math-word-problems-200k`	clean	92	MATH	0.0313	600
`sahil2801/CodeAlpaca-20k`	clean	90	HumanEval	0.0313	600
`HuggingFaceH4/no_robots`	clean	94	WinoGrande	0.0234	600
`Open-Orca/FLAN`	clean	99	CodeContests	0.0234	600
`camel-ai/math`	clean	97	MATH	0.0234	600
`camel-ai/chemistry`	clean	99	MATH	0.0234	600
`TokenBender/code_instructions_122k_alpaca_style`	clean	92	HumanEval	0.0234	600
`tatsu-lab/alpaca`	clean	100	WinoGrande	0.0156	600
`yahma/alpaca-cleaned`	clean	100	SQuAD v2.0	0.0156	600
`vicgalle/alpaca-gpt4`	clean	100	HumanEval	0.0156	600
`databricks/databricks-dolly-15k`	clean	92	WinoGrande	0.0156	600
`WizardLMTeam/WizardLM_evol_instruct_70k`	clean	100	HumanEval	0.0156	597
`jondurbin/airoboros-3.2`	clean	100	TruthfulQA	0.0156	600
`migtissera/Synthia-v1.3`	clean	97	SQuAD v2.0	0.0156	600
`Open-Orca/OpenOrca`	clean	97	ARC Challenge	0.0156	600
`Open-Orca/SlimOrca`	clean	92	HumanEval	0.0156	600
`Open-Orca/SlimOrca-Dedup`	clean	92	HumanEval	0.0156	600
`teknium/openhermes`	clean	97	SQuAD v2.0	0.0156	556
`HuggingFaceH4/ultrachat_200k`	clean	100	HumanEval	0.0156	600
`Muennighoff/flan`	clean	100	ARC Challenge	0.0156	499
`camel-ai/physics`	clean	100	MATH	0.0156	600
`glaiveai/glaive-code-assistant`	clean	92	HumanEval	0.0156	600
`Anthropic/hh-rlhf`	clean	100	TruthfulQA	0.0156	600
`nvidia/HelpSteer`	clean	100	SQuAD v2.0	0.0156	562
`Intel/orca_dpo_pairs`	clean	97	ARC Challenge	0.0156	600
`argilla/distilabel-intel-orca-dpo-pairs`	clean	95	CodeContests	0.0156	600
`argilla/dpo-mix-7k`	clean	100	GSM8K	0.0156	600
`HuggingFaceH4/ultrafeedback_binarized`	clean	90	MATH	0.0156	600
`berkeley-nest/Nectar`	clean	92	HumanEval	0.0156	600
`allenai/WildChat-1M`	clean	100	ARC Challenge	0.0156	600
`0-hero/Matter-0.1`	clean	92	TruthfulQA	0.0156	596
`open-web-math/open-web-math`	clean	100	ARC Challenge	0.0156	600
`timdettmers/openassistant-guanaco`	clean	100	WinoGrande	0.0078	300
`WizardLMTeam/WizardLM_evol_instruct_V2_196k`	clean	92	HumanEval	0.0078	597
`jondurbin/truthy-dpo-v0.1`	clean	100	TruthfulQA	0.0078	600
`Norquinal/claude_multiround_chat_30k`	clean	100	GSM8K	0.0078	600
`LDJnr/Pure-Dove`	clean	100	WinoGrande	0.0078	600
`LDJnr/Verified-Camel`	clean	100	MMLU	0.0078	254
`allenai/tulu-v2-sft-mixture`	clean	100	ARC Challenge	0.0078	494
`LDJnr/Capybara`	clean	100	MMLU	0.0078	600
`ise-uiuc/Magicoder-OSS-Instruct-75K`	clean	92	HumanEval	0.0078	600
`glaiveai/glaive-function-calling-v2`	clean	97	SQuAD v1.1	0.0078	600
`nvidia/HelpSteer2`	clean	100	TruthfulQA	0.0078	588
`OpenAssistant/oasst1`	clean	100	HumanEval	0.0078	600
`OpenAssistant/oasst2`	clean	100	TruthfulQA	0.0078	600
`allenai/WildChat`	clean	100	HellaSwag	0.0078	600
`migtissera/Tess-v1.5`	clean	100	WinoGrande	0.0078	600
`Salesforce/wikitext`	clean	100	WinoGrande	0.0078	209
`abisee/cnn_dailymail`	clean	100	WinoGrande	0.0078	600
`JeanKaddour/minipile`	clean	90	WinoGrande	0.0078	300
`HuggingFaceTB/cosmopedia-100k`	clean	100	ARC Challenge	0.0078	600
`camel-ai/biology`	clean	100	—	0.0000	600
`Helsinki-NLP/europarl`	clean	100	—	0.0000	452
`mlabonne/orca-dpo-pairs`	unknown	—	—	—	—
`cognitivecomputations/dolphin`	unknown	—	—	—	—
`stingning/ultrachat`	unknown	—	—	—	—
`conceptofmind/flan2021_submix_original`	unknown	—	—	—	—
`conceptofmind/cot_submix_original`	unknown	—	—	—	—
`conceptofmind/niv2_submix_original`	unknown	—	—	—	—
`conceptofmind/dialog_submix_original`	unknown	—	—	—	—
`camel-ai/ai_society`	unknown	—	—	—	—
`allenai/strategyqa`	unknown	—	—	—	—
`princeton-nlp/SimPO-data`	unknown	—	—	—	—
`HuggingFaceFW/fineweb`	unknown	—	—	—	—
`togethercomputer/RedPajama-Data-Instruct`	unknown	—	—	—	—

Methodology

Each dataset was sampled at the listed sample size (default 600 records per split). Sampled records were tokenized into n-grams (n=8 by default for text, n=4 for code). N-gram sets were compared against the LQS benchmark fingerprint registry — a deduplicated index of n-grams drawn from the train + validation + test splits of 40+ widely-cited public evaluation suites. Maximum Jaccard similarity across all (split × benchmark) pairs is reported as max_similarity; the worst-matching benchmark is recorded.

Tier thresholds:

clean — max similarity < 0.005, no contaminated phrase hits
minor — max similarity 0.005–0.015 OR phrase hits below threshold
moderate — max similarity 0.015–0.03 OR more than one minor flag
contaminated — max similarity ≥ 0.03 OR direct phrase-match contamination on a high-stakes benchmark

The thresholds were tuned against an empirical labelled set of known-contaminated and known-clean datasets — see the empirical-threshold-validation report at validation/tools/benchmark-audit/REPORT.md.

Sample size caveat: 600 records is a representative sample, not a full enumeration. A dataset scored "clean" here may have hidden contamination in unsampled records. Re-scanning at a higher sample size or against a private fingerprint set is available via the Enterprise tier.

Recourse. If you maintain a dataset listed here and believe a tier or score is wrong, the recourse process is the same as for LQS audits — documented in the methodology preprint §7. File an issue with a counter-citation (e.g. "the n-gram overlap is from a public-domain template, not from MMLU itself") and we'll publish a v1.1 with the correction. We do not modify results under non-public pressure. Every result carries an immutable cert hash; corrections are issued as new versions.

What's next

Contamination Report 002 will expand scope from post-training to pretraining corpora — Common Crawl, The Pile, RedPajama-V2, OSCAR, Dolma, C4, FineWeb. That's the layer where benchmark contamination most affects the score every model card reports, and where the upstream surface determines what every downstream fine-tune inherits.

If you're a maintainer interested in being scanned before the next report lands, reach out via /contact.

Back to audit series Read the methodology (DOI)

80 post-training datasets scanned. 2 flagged contaminated.

The headline finding

Benchmark appearance — how often does each public eval show up in scan results?

Full results — all 80 scanned datasets

Methodology

What's next