Procurement of training data for AI systems in regulated industries (financial services, healthcare, legal) currently lacks an independent quality measurement that satisfies model-risk audit requirements such as SR 11-7, EU AI Act Article 10, FDA 21 CFR 11.10(e), and HHS §1557.
We introduce LQS v3.1, a 19-dimension quality standard for tabular, text, and image datasets that addresses three documented weaknesses of existing single-model quality scores: (1) reference-model bias via a 7-oracle consensus across 5 algorithm families with cross-validated agreement reporting (Cohen and Fleiss κ); (2) brittleness of metadata-derived task inference via a data-driven task detection layer with explicit ambiguity flagging; (3) over-confidence of point estimates via Wilson binomial intervals on rate-based dimensions, pooled-fold standard deviation on oracle-derived dimensions, and bootstrap-derived intervals on the composite.
We add inductive split-conformal prediction (Vovk 2005, Romano 2019) producing 90% prediction intervals on downstream macro-F1 with provable coverage guarantees, and a graded benchmark-contamination dimension covering 40+ public evaluation suites (MMLU, HumanEval, GSM8K, SQuAD, etc.). Every score is bound to a canonical-JSON-serialized payload and signed with an Ed25519 keypair, producing a cryptographically verifiable certificate auditable offline against a published public key.
We provide a reference implementation, a public verification API, and an SDK with no-auth verification helpers. The full LQS v3.1 specification is presented as a candidate reference methodology for ongoing standards work in IEEE P2841, NIST AI RMF, and ISO/IEC JTC 1 SC 42.
Each axis is a quality dimension scored 0–100. Composite is a weighted aggregate of the 19 axes, with bootstrap intervals on the composite and per-dimension Wilson or pooled-fold intervals on the spokes.
A single weak spoke is the kind of failure model-risk auditors look for. A composite of 91 with a sub-50 contamination axis is a different risk profile than a composite of 91 with no weak spokes — and the radar is the only view that surfaces it at a glance.
dataset quality, multi-oracle consensus, confidence intervals, contamination detection, scaling laws, adversarial robustness, fairness, cryptographic certificates, procurement-grade ML
@misc{labelsets2026lqsv31,
title = {LQS v3.1: A Procurement-Grade Quality Standard for
AI Training Data with Cryptographically Verifiable
Certificates},
author = {{LabelSets Research}},
year = {2026},
month = {April},
url = {https://labelsets.ai/paper.pdf},
note = {Reference implementation: labelsets.ai. Principal author
identity disclosed under NDA pending counsel review.}
}
The full reference implementation is deployed at labelsets.ai. The public verification API accepts any LQS certificate hash and returns the signed payload plus signature validation:
GET /api/verify-lqs-cert/:hash
Reading this paper because you're evaluating training-data quality for an SR 11-7, EU AI Act, FDA, or §1557 model package? We're picking 5–10 design partners in regulated industries for 6 months of LQS Enterprise — free, in exchange for a logo and a short case study.