Independent procurement-grade audits of the datasets the AI industry actually uses.
Open methodology. Signed result. Recourse documented. Every audit translates an existing dataset's documentation, ecosystem, and ground-truth signals into a form a procurement, model-risk, or compliance team can cite directly.
Methodology: DOI 10.5281/zenodo.20278981 · CC BY 4.0 · RSS
MIMIC-IV — first procurement-grade healthcare audit. Platinum tier under the FDA 21 CFR 11 / HIPAA / §1557 lens.
The clinical dataset MIT and Beth Israel Deaconess Medical Center built for procurement-grade research. Credentialed access, IRB-waived under HIPAA Safe Harbor, Nature-published, multi-decade maintainer commitment. Two procurement caveats: single-site Boston provenance + HIPAA-not-GDPR de-identification.
Read the audit →HumanEval — a procurement-grade audit of the code benchmark every model card cites.
164 hand-crafted Python problems. Test-driven validation eliminates the label-noise concern that hits MMLU, but two structural ceilings remain: a 164-problem sample size that produces wider Wilson confidence intervals than the gaps between leading models, and a contamination surface that grows every month The Stack and GitHub crawls do.
Read the audit →RedPajama-V2 — a procurement-grade audit of Together AI's 30T-token open pretraining corpus.
Gold tier. Strongest open-pretraining corpus most teams in industry actually run against. Two procurement-relevant gaps: an HF-metadata license signal that returns "unknown," and a quality-classifier whose downstream effect on the data distribution isn't easy to audit. Five Indo-European languages, three quality buckets.
Read the audit →MMLU — a procurement-grade audit of the most-cited LLM benchmark.
Gold tier — not Platinum. The score every model card reports is partially memorized, not earned. Crowdsourced labels with no disclosed QC; independent re-annotation studies find 25–30%+ label noise. Heavy contamination across the open training-data ecosystem. Use as one signal among several, not as a clean capability anchor.
Read the audit →FineWeb-Edu — a procurement-grade audit of HuggingFace's flagship 1.3T-token corpus.
Silver tier. World-class documentation. A circular LLM-as-judge dependency we think procurement teams should understand. An ODC-By attribution gap that almost every commercial user violates. Open methodology, signed result, recourse process documented.
Read the audit →Planned reports
next 90 days · subject to capacityHow these audits work
Every audit scores the dataset under LQS v3.1: a 19-dimension procurement-grade quality standard with multi-oracle consensus, conformal-prediction intervals on downstream macro-F1, contamination checks across 40+ public evaluation suites, and Ed25519-signed certs auditors verify offline. Methodology is published with a permanent DOI. Scores carry immutable cryptographic hashes; corrections are issued as new versions, never silent edits.
If you maintain a dataset audited here and believe a score is wrong, recourse is documented in the methodology preprint §7. File an issue with a counter-citation; we publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure.
Get the next audit when it lands.
One email per report. No marketing. Methodology updates included. Average cadence: every 2-3 weeks.
Score your own dataset →