Public LQS audits

Independent procurement-grade audits of the datasets the AI industry actually uses.

Open methodology. Signed result. Recourse documented. Every audit translates an existing dataset's documentation, ecosystem, and ground-truth signals into a form a procurement, model-risk, or compliance team can cite directly.

Report 006 · Healthcare May 19, 2026 · 12 min
93/ 100 Platinum

MIMIC-IV — first procurement-grade healthcare audit. Platinum tier under the FDA 21 CFR 11 / HIPAA / §1557 lens.

The clinical dataset MIT and Beth Israel Deaconess Medical Center built for procurement-grade research. Credentialed access, IRB-waived under HIPAA Safe Harbor, Nature-published, multi-decade maintainer commitment. Two procurement caveats: single-site Boston provenance + HIPAA-not-GDPR de-identification.

Read the audit →
Report 005 · Code May 19, 2026 · 9 min
83/ 100 Gold

HumanEval — a procurement-grade audit of the code benchmark every model card cites.

164 hand-crafted Python problems. Test-driven validation eliminates the label-noise concern that hits MMLU, but two structural ceilings remain: a 164-problem sample size that produces wider Wilson confidence intervals than the gaps between leading models, and a contamination surface that grows every month The Stack and GitHub crawls do.

Read the audit →
Report 003 May 19, 2026 · 10 min
81/ 100 Gold

RedPajama-V2 — a procurement-grade audit of Together AI's 30T-token open pretraining corpus.

Gold tier. Strongest open-pretraining corpus most teams in industry actually run against. Two procurement-relevant gaps: an HF-metadata license signal that returns "unknown," and a quality-classifier whose downstream effect on the data distribution isn't easy to audit. Five Indo-European languages, three quality buckets.

Read the audit →
Report 002 May 19, 2026 · 11 min
83/ 100 Gold

MMLU — a procurement-grade audit of the most-cited LLM benchmark.

Gold tier — not Platinum. The score every model card reports is partially memorized, not earned. Crowdsourced labels with no disclosed QC; independent re-annotation studies find 25–30%+ label noise. Heavy contamination across the open training-data ecosystem. Use as one signal among several, not as a clean capability anchor.

Read the audit →
Report 001 May 13, 2026 · 9 min
73/ 100 Silver

FineWeb-Edu — a procurement-grade audit of HuggingFace's flagship 1.3T-token corpus.

Silver tier. World-class documentation. A circular LLM-as-judge dependency we think procurement teams should understand. An ODC-By attribution gap that almost every commercial user violates. Open methodology, signed result, recourse process documented.

Read the audit →

Planned reports

next 90 days · subject to capacity
Contam. 001
Contamination Report 001 — 80 post-training datasets Real n-gram overlap scan results across 80 popular post-training / RLHF / instruction datasets against 40+ public eval benchmark fingerprints. 2 contaminated, 1 moderate, 12 minor, 53 clean.
Live · May 19
Report 004
The Pile (EleutherAI) Foundational. Known Books3 / copyright surface. The most-litigated open pretraining corpus.
In progress
Report 007
Radiology imaging corpus (CheXpert or MIMIC-CXR) First imaging-modality audit under the FDA SaMD lens.
Scoping

How these audits work

Every audit scores the dataset under LQS v3.1: a 19-dimension procurement-grade quality standard with multi-oracle consensus, conformal-prediction intervals on downstream macro-F1, contamination checks across 40+ public evaluation suites, and Ed25519-signed certs auditors verify offline. Methodology is published with a permanent DOI. Scores carry immutable cryptographic hashes; corrections are issued as new versions, never silent edits.

If you maintain a dataset audited here and believe a score is wrong, recourse is documented in the methodology preprint §7. File an issue with a counter-citation; we publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure.

Get the next audit when it lands.

One email per report. No marketing. Methodology updates included. Average cadence: every 2-3 weeks.

Score your own dataset →