Buying training data has always been an act of faith. You read the listing, look at the sample preview, hope the labels are accurate, hope it isn't repackaged from a public source, hope it doesn't accidentally include the eval set you're about to benchmark on, hope your legal team won't kill the project six months later when GDPR comes up.

We've decided to remove every one of those hopes from the buying process. Every dataset on LabelSets now ships with five pieces of automated intelligence — generated from the moment the file is uploaded, attached to the listing as proof, and visible to buyers before they spend a dollar.

Here's what's new, why each one matters, and what it looks like when you open a dataset page.

1. Eval-Clean Certificate

01
✓ Eval-Clean

Verified contamination-free against 13 ML benchmarks

SQuAD, MMLU, HumanEval, GSM8K, HellaSwag, ARC, TruthfulQA, WinoGrande, MATH, CodeContests, and more

The single biggest hidden risk in training data is benchmark contamination — when a dataset secretly contains samples from the eval set you're about to benchmark on. The result: inflated metrics, unreproducible results, and embarrassing post-launch discoveries.

Every dataset on LabelSets is now scanned at upload time against MinHash signatures of 13 known eval sets, plus characteristic-phrase matching as a second layer. Datasets that pass receive a SHA-256 contamination certificate and the Eval-Clean badge. Datasets that fail are flagged with the specific benchmark they overlap with and a similarity score.

Why this matters: training on contaminated data invalidates any benchmark result you publish. Until now, the only way to detect this was to run your own contamination scan against every dataset you bought. We're doing it for you, automatically, for every listing.

2. Originality Verification

02
◆ Original

Multi-layer plagiarism + republishing detection

MinHash LSH · statistical fingerprinting · external registry · async web sampling

Every upload is checked against four originality signals before it can be listed:

Datasets scoring above 0.95 originality earn the Original badge and an originality certificate. Datasets that match an existing source are blocked from listing.

Why this matters: the value of a marketplace dataset depends entirely on it being unique. If anyone can repackage Common Crawl and resell it, the price collapses and the trust evaporates. Originality verification is what keeps the marketplace investable.

3. Gebru Datasheet

03
📋 Datasheet

Auto-generated structured documentation following Gebru et al. (2018)

Motivation · Composition · Collection · Preprocessing · Uses · Distribution · Maintenance

The "Datasheets for Datasets" framework was proposed by Timnit Gebru and co-authors in 2018 as the gold standard for ML dataset documentation. The catch: writing a good Gebru datasheet by hand takes 4-6 hours per dataset. Almost nobody does it.

LabelSets now generates a complete Gebru datasheet automatically for every published dataset, populated from validation stats, provenance, label types, collection method, and seller-provided metadata. Buyers see seven structured sections: motivation, composition, collection process, preprocessing, intended uses (and unsuitable uses), distribution, and maintenance.

The datasheet renders inline on every dataset page under the new Intelligence tab, and is also available as raw markdown via GET /api/datasets/:id/datasheet?format=markdown for buyers who want to include it in their model card.

Why this matters: the EU AI Act requires structured training data documentation for high-risk systems. A Gebru datasheet is the cleanest way to satisfy that requirement, and now you don't have to write it yourself.

4. Compliance Report (GDPR / CCPA / HIPAA / EU AI Act)

04
🛡 Compliance

Multi-framework compliance assessment with score and gap analysis

GDPR Art. 30 · CCPA · HIPAA · EU AI Act Annex IV · combined report

Every dataset gets a full compliance report covering the four frameworks that matter for production ML: GDPR, CCPA, HIPAA, and the EU AI Act. Each report includes a 0-100 compliance score, identified gaps, recommendations, and a list of certifications the dataset qualifies for.

The report is generated by a structured Claude prompt working from the dataset's provenance fields (collection method, consent type, license, jurisdiction) and is stored as both structured JSON and rendered HTML. Buyers can read it before purchase and download the HTML version for their compliance file after buying.

Why this matters: compliance review is the longest pole in any enterprise data procurement process. Datasets that arrive with the report already done close 10× faster than ones that don't. We're shipping that report on every listing.

5. AI Valuation

05
Ⓥ Valued $720

Multi-factor fair-market value estimate with confidence range

Uniqueness 35% · Demand 25% · Quality 20% · Comparables 15% · Outcomes 5%

Every dataset gets a fair-market valuation computed from five weighted signals:

The result is a USD estimate with low/high confidence range, displayed both as a card on the dataset page and as a price-vs-AI-valuation delta in the seller dashboard so sellers can see when they're under- or over-priced relative to the model.

Why this matters: data pricing is opaque to both sides of the marketplace. Sellers underprice scarce data and overprice common data. Buyers have no way to compare apples to apples. A transparent multi-factor valuation gives both sides a defensible reference point.

How It All Fits Together

The five systems are not independent. They form a single intelligence pipeline that runs the moment a dataset is uploaded:

Upload File integrity, magic bytes, malware, NSFW Originality engineoriginality_score, certificate Benchmark contamination scaneval-clean cert OR contaminated_with[] LQS computation (14 dimensions)quality_breakdown Publish Auto-trigger: Gebru datasheet (Claude)7 structured sections + markdown Compliance report (Claude)GDPR + CCPA + HIPAA + EU AI Act Multi-factor valuation$ estimate + range Live on marketplace with all 5 badges

From the seller's perspective: drop the file, fill out the listing, click publish. Everything else happens automatically. From the buyer's perspective: every listing now includes the answers to questions you used to have to ask the seller individually.

What This Means for Sellers

You don't have to do anything. Every dataset you publish from now on will go through the full pipeline automatically. Existing flagship datasets have already been backfilled.

You also get a new Intelligence panel in your seller dashboard showing:

What This Means for Buyers

When you open a dataset page, click the new Intelligence tab to see all five reports for that dataset. You'll find:

You can also report your own training results via the new Report results button on every completed purchase in your buyer dashboard. Reported outcomes flow back into the LQS calibration engine and improve future quality scoring for that category.

This is the proprietary infrastructure that makes LabelSets defensible. Anyone can build a marketplace; what makes one valuable is the trust layer underneath. We've spent the last quarter building it. Browse the marketplace → and click into any listing to see the new Intelligence tab in action.

What's Next

The intelligence stack is live as of today. Over the coming weeks we're shipping:

If you sell data on LabelSets, your existing listings already have the new badges where applicable, and your next upload will auto-generate the full intelligence package. If you buy data, the next time you open any listing you'll see the Intelligence tab with everything in it.

Questions? Ping us at support@labelsets.ai.