Five Things Every Dataset on LabelSets Now Ships With

Buying training data has always been an act of faith. You read the listing, look at the sample preview, hope the labels are accurate, hope it isn't repackaged from a public source, hope it doesn't accidentally include the eval set you're about to benchmark on, hope your legal team won't kill the project six months later when GDPR comes up.

We've decided to remove every one of those hopes from the buying process. Every dataset on LabelSets now ships with five pieces of automated intelligence — generated from the moment the file is uploaded, attached to the listing as proof, and visible to buyers before they spend a dollar.

Here's what's new, why each one matters, and what it looks like when you open a dataset page.

1. Eval-Clean Certificate

✓ Eval-Clean

Verified contamination-free against 13 ML benchmarks

SQuAD, MMLU, HumanEval, GSM8K, HellaSwag, ARC, TruthfulQA, WinoGrande, MATH, CodeContests, and more

The single biggest hidden risk in training data is benchmark contamination — when a dataset secretly contains samples from the eval set you're about to benchmark on. The result: inflated metrics, unreproducible results, and embarrassing post-launch discoveries.

Every dataset on LabelSets is now scanned at upload time against MinHash signatures of 13 known eval sets, plus characteristic-phrase matching as a second layer. Datasets that pass receive a SHA-256 contamination certificate and the Eval-Clean badge. Datasets that fail are flagged with the specific benchmark they overlap with and a similarity score.

Why this matters: training on contaminated data invalidates any benchmark result you publish. Until now, the only way to detect this was to run your own contamination scan against every dataset you bought. We're doing it for you, automatically, for every listing.

2. Originality Verification

◆ Original

Multi-layer plagiarism + republishing detection

MinHash LSH · statistical fingerprinting · external registry · async web sampling

Every upload is checked against four originality signals before it can be listed:

MinHash LSH — 128-element near-duplicate detection across the entire LabelSets corpus, catches reformatted copies of existing datasets
Statistical fingerprinting — column distributions, vocabulary profiles, and color histograms detect structural copies even if the content is shuffled or renamed
External registry search — HuggingFace and Papers With Code lookups catch sellers republishing public datasets with new names
Async web sampling — random text passages from the dataset get checked against the open web post-publish, with the seller flagged if hits cluster

Datasets scoring above 0.95 originality earn the Original badge and an originality certificate. Datasets that match an existing source are blocked from listing.

Why this matters: the value of a marketplace dataset depends entirely on it being unique. If anyone can repackage Common Crawl and resell it, the price collapses and the trust evaporates. Originality verification is what keeps the marketplace investable.

3. Gebru Datasheet

📋 Datasheet

Auto-generated structured documentation following Gebru et al. (2018)

Motivation · Composition · Collection · Preprocessing · Uses · Distribution · Maintenance

The "Datasheets for Datasets" framework was proposed by Timnit Gebru and co-authors in 2018 as the gold standard for ML dataset documentation. The catch: writing a good Gebru datasheet by hand takes 4-6 hours per dataset. Almost nobody does it.

LabelSets now generates a complete Gebru datasheet automatically for every published dataset, populated from validation stats, provenance, label types, collection method, and seller-provided metadata. Buyers see seven structured sections: motivation, composition, collection process, preprocessing, intended uses (and unsuitable uses), distribution, and maintenance.

The datasheet renders inline on every dataset page under the new Intelligence tab, and is also available as raw markdown via GET /api/datasets/:id/datasheet?format=markdown for buyers who want to include it in their model card.

Why this matters: the EU AI Act requires structured training data documentation for high-risk systems. A Gebru datasheet is the cleanest way to satisfy that requirement, and now you don't have to write it yourself.

4. Compliance Report (GDPR / CCPA / HIPAA / EU AI Act)

🛡 Compliance

Multi-framework compliance assessment with score and gap analysis

GDPR Art. 30 · CCPA · HIPAA · EU AI Act Annex IV · combined report

Every dataset gets a full compliance report covering the four frameworks that matter for production ML: GDPR, CCPA, HIPAA, and the EU AI Act. Each report includes a 0-100 compliance score, identified gaps, recommendations, and a list of certifications the dataset qualifies for.

The report is generated by a structured Claude prompt working from the dataset's provenance fields (collection method, consent type, license, jurisdiction) and is stored as both structured JSON and rendered HTML. Buyers can read it before purchase and download the HTML version for their compliance file after buying.

Why this matters: compliance review is the longest pole in any enterprise data procurement process. Datasets that arrive with the report already done close 10× faster than ones that don't. We're shipping that report on every listing.

5. AI Valuation

Ⓥ Valued $720

Multi-factor fair-market value estimate with confidence range

Uniqueness 35% · Demand 25% · Quality 20% · Comparables 15% · Outcomes 5%

Every dataset gets a fair-market valuation computed from five weighted signals:

Uniqueness (35%) — pulled from the originality engine
Demand (25%) — derived from search logs and zero-result query patterns in the dataset's category
Quality (20%) — the LabelSets Quality Score (LQS) composite
Comparables (15%) — price positioning vs similar published datasets
Outcomes (5%) — average improvement reported by buyers who trained on it

The result is a USD estimate with low/high confidence range, displayed both as a card on the dataset page and as a price-vs-AI-valuation delta in the seller dashboard so sellers can see when they're under- or over-priced relative to the model.

Why this matters: data pricing is opaque to both sides of the marketplace. Sellers underprice scarce data and overprice common data. Buyers have no way to compare apples to apples. A transparent multi-factor valuation gives both sides a defensible reference point.

How It All Fits Together

The five systems are not independent. They form a single intelligence pipeline that runs the moment a dataset is uploaded:

Upload ↓ File integrity, magic bytes, malware, NSFW ↓ Originality engine → originality_score, certificate ↓ Benchmark contamination scan → eval-clean cert OR contaminated_with[] ↓ LQS computation (14 dimensions) → quality_breakdown ↓ Publish ↓ Auto-trigger: → Gebru datasheet (Claude) → 7 structured sections + markdown → Compliance report (Claude) → GDPR + CCPA + HIPAA + EU AI Act → Multi-factor valuation → $ estimate + range ↓ Live on marketplace with all 5 badges

From the seller's perspective: drop the file, fill out the listing, click publish. Everything else happens automatically. From the buyer's perspective: every listing now includes the answers to questions you used to have to ask the seller individually.

What This Means for Sellers

You don't have to do anything. Every dataset you publish from now on will go through the full pipeline automatically. Existing flagship datasets have already been backfilled.

You also get a new Intelligence panel in your seller dashboard showing:

Demand gaps — searches in your category that returned zero results, with example queries (these are direct dataset opportunities)
Outcome stats — real training results reported by buyers, with average improvement percentages
Valuation deltas — your asking price vs the AI-estimated fair value for each of your datasets

What This Means for Buyers

When you open a dataset page, click the new Intelligence tab to see all five reports for that dataset. You'll find:

The contamination certificate hash and which benchmarks were checked
The full Gebru datasheet, expandable inline
The compliance report with scores and per-framework status
The valuation breakdown with all five factor scores
Aggregated training outcomes from other buyers (if any)

You can also report your own training results via the new Report results button on every completed purchase in your buyer dashboard. Reported outcomes flow back into the LQS calibration engine and improve future quality scoring for that category.

This is the proprietary infrastructure that makes LabelSets defensible. Anyone can build a marketplace; what makes one valuable is the trust layer underneath. We've spent the last quarter building it. Browse the marketplace → and click into any listing to see the new Intelligence tab in action.

What's Next

The intelligence stack is live as of today. Over the coming weeks we're shipping:

Outcome-driven LQS calibration — weekly Pearson correlation between buyer-reported outcomes and LQS dimension scores will tune the quality scoring per category
Public contamination certificate pages — shareable URLs for each cert so buyers can link them in compliance reviews
Secure compute jobs — train-without-download infrastructure for sellers who want stronger IP protection (queue is built; provider integration in progress)

If you sell data on LabelSets, your existing listings already have the new badges where applicable, and your next upload will auto-generate the full intelligence package. If you buy data, the next time you open any listing you'll see the Intelligence tab with everything in it.

Questions? Ping us at support@labelsets.ai.

Five Things Every Dataset on LabelSets Now Ships With

1. Eval-Clean Certificate

Verified contamination-free against 13 ML benchmarks

2. Originality Verification

Multi-layer plagiarism + republishing detection

3. Gebru Datasheet

Auto-generated structured documentation following Gebru et al. (2018)

4. Compliance Report (GDPR / CCPA / HIPAA / EU AI Act)

Multi-framework compliance assessment with score and gap analysis

5. AI Valuation

Multi-factor fair-market value estimate with confidence range

How It All Fits Together

What This Means for Sellers

What This Means for Buyers

What's Next

Browse datasets with full intelligence

Related Articles & Categories