Contamination Report 001

80 post-training datasets scanned. 2 flagged contaminated.

Public n-gram overlap scan of widely-used instruction, RLHF, preference, and reasoning datasets on HuggingFace against 40+ public evaluation benchmark fingerprints (MMLU, HellaSwag, GSM8K, ARC, HumanEval, TruthfulQA, BBH, and 33 more). Open methodology, raw similarity metrics, every dataset listed below.

80datasets scanned
53clean
13minor / moderate
2contaminated

The headline finding

Of the 80 popular HuggingFace post-training datasets we scanned, 2 are measurably contaminated with public eval benchmark content (Jaccard similarity above the LQS v3.1 contamination threshold). Another 1 sit at moderate and 12 at minor — measurable n-gram overlap that doesn't yet trip the threshold but is worth disclosing to downstream model trainers.

The two flagged datasets are revealing.

Why this matters. Procurement teams citing a model's MATH or MMLU score under SR 11-7, EU AI Act Art. 10, FDA 21 CFR 11, or §1557 paperwork inherit the contamination surface of the model's training corpus. This scan publishes the surface for 80 popular datasets so that disclosure is mechanical, not narrative.

Benchmark appearance — how often does each public eval show up in scan results?

For each of the 40+ public evaluation benchmark fingerprints indexed by the LQS contamination scanner, we counted how many of the 80 scanned datasets had any measurable n-gram overlap (similarity > 0). Top 12 below.

Benchmark Datasets with any overlap Share of scanned (80)
HumanEval 27 33.8%
MATH 23 28.7%
ARC Challenge 23 28.7%
TruthfulQA 20 25.0%
GSM8K 19 23.8%
WinoGrande 18 22.5%
SQuAD v2.0 14 17.5%
CodeContests 13 16.3%
SQuAD v1.1 11 13.8%
MMLU 7 8.8%
HellaSwag 3 3.8%

Interpretation: "Any overlap" means the dataset had non-zero similarity with the benchmark, not that it's contaminated. Most overlaps are below the threshold. The signal worth tracking is the worst-benchmark column in the full table below.

Full results — all 80 scanned datasets

Sorted: contaminated → moderate → minor → unknown → clean, then by max similarity descending within each tier.

Dataset Tier Score Worst benchmark Max sim Sample
garage-bAInd/Open-Platypus contaminated 25 MATH 0.1167 600
TIGER-Lab/MMLU-Pro contaminated 44 MMLU 0.0517 600
jondurbin/py-dpo-v0.1 moderate 68 HumanEval 0.0391 600
nickrosh/Evol-Instruct-Code-80k-v1 minor 75 HumanEval 0.0391 600
AI-MO/NuminaMath-CoT minor 72 CodeContests 0.0313 600
meta-math/MetaMathQA minor 70 MATH 0.0283 600
nvidia/OpenMathInstruct-1 minor 86 MATH 0.0234 600
m-a-p/Code-Feedback minor 82 HumanEval 0.0234 600
mlabonne/guanaco-llama2-1k minor 85 TruthfulQA 0.0156 300
jondurbin/airoboros-2.2.1 minor 87 TruthfulQA 0.0156 600
TIGER-Lab/MathInstruct minor 87 ARC Challenge 0.0156 600
ise-uiuc/Magicoder-Evol-Instruct-110K minor 87 CodeContests 0.0156 600
iamtarun/python_code_instructions_18k_alpaca minor 82 HumanEval 0.0100 600
teknium/OpenHermes-2.5 minor 85 GSM8K 0.0078 600
argilla/OpenHermes2.5-dpo-binarized-alpha minor 87 GSM8K 0.0078 600
microsoft/orca-math-word-problems-200k clean 92 MATH 0.0313 600
sahil2801/CodeAlpaca-20k clean 90 HumanEval 0.0313 600
HuggingFaceH4/no_robots clean 94 WinoGrande 0.0234 600
Open-Orca/FLAN clean 99 CodeContests 0.0234 600
camel-ai/math clean 97 MATH 0.0234 600
camel-ai/chemistry clean 99 MATH 0.0234 600
TokenBender/code_instructions_122k_alpaca_style clean 92 HumanEval 0.0234 600
tatsu-lab/alpaca clean 100 WinoGrande 0.0156 600
yahma/alpaca-cleaned clean 100 SQuAD v2.0 0.0156 600
vicgalle/alpaca-gpt4 clean 100 HumanEval 0.0156 600
databricks/databricks-dolly-15k clean 92 WinoGrande 0.0156 600
WizardLMTeam/WizardLM_evol_instruct_70k clean 100 HumanEval 0.0156 597
jondurbin/airoboros-3.2 clean 100 TruthfulQA 0.0156 600
migtissera/Synthia-v1.3 clean 97 SQuAD v2.0 0.0156 600
Open-Orca/OpenOrca clean 97 ARC Challenge 0.0156 600
Open-Orca/SlimOrca clean 92 HumanEval 0.0156 600
Open-Orca/SlimOrca-Dedup clean 92 HumanEval 0.0156 600
teknium/openhermes clean 97 SQuAD v2.0 0.0156 556
HuggingFaceH4/ultrachat_200k clean 100 HumanEval 0.0156 600
Muennighoff/flan clean 100 ARC Challenge 0.0156 499
camel-ai/physics clean 100 MATH 0.0156 600
glaiveai/glaive-code-assistant clean 92 HumanEval 0.0156 600
Anthropic/hh-rlhf clean 100 TruthfulQA 0.0156 600
nvidia/HelpSteer clean 100 SQuAD v2.0 0.0156 562
Intel/orca_dpo_pairs clean 97 ARC Challenge 0.0156 600
argilla/distilabel-intel-orca-dpo-pairs clean 95 CodeContests 0.0156 600
argilla/dpo-mix-7k clean 100 GSM8K 0.0156 600
HuggingFaceH4/ultrafeedback_binarized clean 90 MATH 0.0156 600
berkeley-nest/Nectar clean 92 HumanEval 0.0156 600
allenai/WildChat-1M clean 100 ARC Challenge 0.0156 600
0-hero/Matter-0.1 clean 92 TruthfulQA 0.0156 596
open-web-math/open-web-math clean 100 ARC Challenge 0.0156 600
timdettmers/openassistant-guanaco clean 100 WinoGrande 0.0078 300
WizardLMTeam/WizardLM_evol_instruct_V2_196k clean 92 HumanEval 0.0078 597
jondurbin/truthy-dpo-v0.1 clean 100 TruthfulQA 0.0078 600
Norquinal/claude_multiround_chat_30k clean 100 GSM8K 0.0078 600
LDJnr/Pure-Dove clean 100 WinoGrande 0.0078 600
LDJnr/Verified-Camel clean 100 MMLU 0.0078 254
allenai/tulu-v2-sft-mixture clean 100 ARC Challenge 0.0078 494
LDJnr/Capybara clean 100 MMLU 0.0078 600
ise-uiuc/Magicoder-OSS-Instruct-75K clean 92 HumanEval 0.0078 600
glaiveai/glaive-function-calling-v2 clean 97 SQuAD v1.1 0.0078 600
nvidia/HelpSteer2 clean 100 TruthfulQA 0.0078 588
OpenAssistant/oasst1 clean 100 HumanEval 0.0078 600
OpenAssistant/oasst2 clean 100 TruthfulQA 0.0078 600
allenai/WildChat clean 100 HellaSwag 0.0078 600
migtissera/Tess-v1.5 clean 100 WinoGrande 0.0078 600
Salesforce/wikitext clean 100 WinoGrande 0.0078 209
abisee/cnn_dailymail clean 100 WinoGrande 0.0078 600
JeanKaddour/minipile clean 90 WinoGrande 0.0078 300
HuggingFaceTB/cosmopedia-100k clean 100 ARC Challenge 0.0078 600
camel-ai/biology clean 100 0.0000 600
Helsinki-NLP/europarl clean 100 0.0000 452
mlabonne/orca-dpo-pairs unknown
cognitivecomputations/dolphin unknown
stingning/ultrachat unknown
conceptofmind/flan2021_submix_original unknown
conceptofmind/cot_submix_original unknown
conceptofmind/niv2_submix_original unknown
conceptofmind/dialog_submix_original unknown
camel-ai/ai_society unknown
allenai/strategyqa unknown
princeton-nlp/SimPO-data unknown
HuggingFaceFW/fineweb unknown
togethercomputer/RedPajama-Data-Instruct unknown

Methodology

Each dataset was sampled at the listed sample size (default 600 records per split). Sampled records were tokenized into n-grams (n=8 by default for text, n=4 for code). N-gram sets were compared against the LQS benchmark fingerprint registry — a deduplicated index of n-grams drawn from the train + validation + test splits of 40+ widely-cited public evaluation suites. Maximum Jaccard similarity across all (split × benchmark) pairs is reported as max_similarity; the worst-matching benchmark is recorded.

Tier thresholds:

The thresholds were tuned against an empirical labelled set of known-contaminated and known-clean datasets — see the empirical-threshold-validation report at validation/tools/benchmark-audit/REPORT.md.

Sample size caveat: 600 records is a representative sample, not a full enumeration. A dataset scored "clean" here may have hidden contamination in unsampled records. Re-scanning at a higher sample size or against a private fingerprint set is available via the Enterprise tier.

Recourse. If you maintain a dataset listed here and believe a tier or score is wrong, the recourse process is the same as for LQS audits — documented in the methodology preprint §7. File an issue with a counter-citation (e.g. "the n-gram overlap is from a public-domain template, not from MMLU itself") and we'll publish a v1.1 with the correction. We do not modify results under non-public pressure. Every result carries an immutable cert hash; corrections are issued as new versions.

What's next

Contamination Report 002 will expand scope from post-training to pretraining corpora — Common Crawl, The Pile, RedPajama-V2, OSCAR, Dolma, C4, FineWeb. That's the layer where benchmark contamination most affects the score every model card reports, and where the upstream surface determines what every downstream fine-tune inherits.

If you're a maintainer interested in being scanned before the next report lands, reach out via /contact.

Back to audit series Read the methodology (DOI)