Public LQS audits

Independent procurement-grade audits of the datasets the AI industry actually uses.

Open methodology. Signed result. Recourse documented. Every audit translates an existing dataset's documentation, ecosystem, and ground-truth signals into a form a procurement, model-risk, or compliance team can cite directly.

Methodology: DOI 10.5281/zenodo.20278981 · CC BY 4.0 · RSS

Report 006 · Healthcare May 19, 2026 · 12 min

93/ 100 Platinum

MIMIC-IV — first procurement-grade healthcare audit. Platinum tier under the FDA 21 CFR 11 / HIPAA / §1557 lens.

The clinical dataset MIT and Beth Israel Deaconess Medical Center built for procurement-grade research. Credentialed access, IRB-waived under HIPAA Safe Harbor, Nature-published, multi-decade maintainer commitment. Two procurement caveats: single-site Boston provenance + HIPAA-not-GDPR de-identification.

Read the audit →

Report 005 · Code May 19, 2026 · 9 min

83/ 100 Gold

HumanEval — a procurement-grade audit of the code benchmark every model card cites.

164 hand-crafted Python problems. Test-driven validation eliminates the label-noise concern that hits MMLU, but two structural ceilings remain: a 164-problem sample size that produces wider Wilson confidence intervals than the gaps between leading models, and a contamination surface that grows every month The Stack and GitHub crawls do.

Read the audit →

Report 003 May 19, 2026 · 10 min

81/ 100 Gold

RedPajama-V2 — a procurement-grade audit of Together AI's 30T-token open pretraining corpus.

Gold tier. Strongest open-pretraining corpus most teams in industry actually run against. Two procurement-relevant gaps: an HF-metadata license signal that returns "unknown," and a quality-classifier whose downstream effect on the data distribution isn't easy to audit. Five Indo-European languages, three quality buckets.

Read the audit →

Report 002 May 19, 2026 · 11 min

83/ 100 Gold

MMLU — a procurement-grade audit of the most-cited LLM benchmark.

Gold tier — not Platinum. The score every model card reports is partially memorized, not earned. Crowdsourced labels with no disclosed QC; independent re-annotation studies find 25–30%+ label noise. Heavy contamination across the open training-data ecosystem. Use as one signal among several, not as a clean capability anchor.

Read the audit →

Report 001 May 13, 2026 · 9 min

73/ 100 Silver

FineWeb-Edu — a procurement-grade audit of HuggingFace's flagship 1.3T-token corpus.

Silver tier. World-class documentation. A circular LLM-as-judge dependency we think procurement teams should understand. An ODC-By attribution gap that almost every commercial user violates. Open methodology, signed result, recourse process documented.

Read the audit →

Planned reports

next 90 days · subject to capacity

Contam. 001

Contamination Report 001 — 80 post-training datasets Real n-gram overlap scan results across 80 popular post-training / RLHF / instruction datasets against 40+ public eval benchmark fingerprints. 2 contaminated, 1 moderate, 12 minor, 53 clean.

Live · May 19

Report 004

The Pile (EleutherAI) Foundational. Known Books3 / copyright surface. The most-litigated open pretraining corpus.

In progress

Report 007

Radiology imaging corpus (CheXpert or MIMIC-CXR) First imaging-modality audit under the FDA SaMD lens.

Scoping

How these audits work

Every audit scores the dataset under LQS v3.1: a 19-dimension procurement-grade quality standard with multi-oracle consensus, conformal-prediction intervals on downstream macro-F1, contamination checks across 40+ public evaluation suites, and Ed25519-signed certs auditors verify offline. Methodology is published with a permanent DOI. Scores carry immutable cryptographic hashes; corrections are issued as new versions, never silent edits.

If you maintain a dataset audited here and believe a score is wrong, recourse is documented in the methodology preprint §7. File an issue with a counter-citation; we publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure.

Methodology: DOI 10.5281/zenodo.20278981 License: CC BY 4.0 Author: Alex Adrion · Labelsets LLC Subscribe via RSS

Independent procurement-grade audits of the datasets the AI industry actually uses.

MIMIC-IV — first procurement-grade healthcare audit. Platinum tier under the FDA 21 CFR 11 / HIPAA / §1557 lens.

HumanEval — a procurement-grade audit of the code benchmark every model card cites.

RedPajama-V2 — a procurement-grade audit of Together AI's 30T-token open pretraining corpus.

MMLU — a procurement-grade audit of the most-cited LLM benchmark.

FineWeb-Edu — a procurement-grade audit of HuggingFace's flagship 1.3T-token corpus.

Planned reports

How these audits work

Get the next audit when it lands.