Buying training data has always been an act of faith. You read the listing, look at the sample preview, hope the labels are accurate, hope it isn't repackaged from a public source, hope it doesn't accidentally include the eval set you're about to benchmark on, hope your legal team won't kill the project six months later when GDPR comes up.
We've decided to remove every one of those hopes from the buying process. Every dataset on LabelSets now ships with five pieces of automated intelligence — generated from the moment the file is uploaded, attached to the listing as proof, and visible to buyers before they spend a dollar.
Here's what's new, why each one matters, and what it looks like when you open a dataset page.
1. Eval-Clean Certificate
Verified contamination-free against 13 ML benchmarks
The single biggest hidden risk in training data is benchmark contamination — when a dataset secretly contains samples from the eval set you're about to benchmark on. The result: inflated metrics, unreproducible results, and embarrassing post-launch discoveries.
Every dataset on LabelSets is now scanned at upload time against MinHash signatures of 13 known eval sets, plus characteristic-phrase matching as a second layer. Datasets that pass receive a SHA-256 contamination certificate and the Eval-Clean badge. Datasets that fail are flagged with the specific benchmark they overlap with and a similarity score.
2. Originality Verification
Multi-layer plagiarism + republishing detection
Every upload is checked against four originality signals before it can be listed:
- MinHash LSH — 128-element near-duplicate detection across the entire LabelSets corpus, catches reformatted copies of existing datasets
- Statistical fingerprinting — column distributions, vocabulary profiles, and color histograms detect structural copies even if the content is shuffled or renamed
- External registry search — HuggingFace and Papers With Code lookups catch sellers republishing public datasets with new names
- Async web sampling — random text passages from the dataset get checked against the open web post-publish, with the seller flagged if hits cluster
Datasets scoring above 0.95 originality earn the Original badge and an originality certificate. Datasets that match an existing source are blocked from listing.
3. Gebru Datasheet
Auto-generated structured documentation following Gebru et al. (2018)
The "Datasheets for Datasets" framework was proposed by Timnit Gebru and co-authors in 2018 as the gold standard for ML dataset documentation. The catch: writing a good Gebru datasheet by hand takes 4-6 hours per dataset. Almost nobody does it.
LabelSets now generates a complete Gebru datasheet automatically for every published dataset, populated from validation stats, provenance, label types, collection method, and seller-provided metadata. Buyers see seven structured sections: motivation, composition, collection process, preprocessing, intended uses (and unsuitable uses), distribution, and maintenance.
The datasheet renders inline on every dataset page under the new Intelligence tab, and is also available as raw markdown via GET /api/datasets/:id/datasheet?format=markdown for buyers who want to include it in their model card.
4. Compliance Report (GDPR / CCPA / HIPAA / EU AI Act)
Multi-framework compliance assessment with score and gap analysis
Every dataset gets a full compliance report covering the four frameworks that matter for production ML: GDPR, CCPA, HIPAA, and the EU AI Act. Each report includes a 0-100 compliance score, identified gaps, recommendations, and a list of certifications the dataset qualifies for.
The report is generated by a structured Claude prompt working from the dataset's provenance fields (collection method, consent type, license, jurisdiction) and is stored as both structured JSON and rendered HTML. Buyers can read it before purchase and download the HTML version for their compliance file after buying.
5. AI Valuation
Multi-factor fair-market value estimate with confidence range
Every dataset gets a fair-market valuation computed from five weighted signals:
- Uniqueness (35%) — pulled from the originality engine
- Demand (25%) — derived from search logs and zero-result query patterns in the dataset's category
- Quality (20%) — the LabelSets Quality Score (LQS) composite
- Comparables (15%) — price positioning vs similar published datasets
- Outcomes (5%) — average improvement reported by buyers who trained on it
The result is a USD estimate with low/high confidence range, displayed both as a card on the dataset page and as a price-vs-AI-valuation delta in the seller dashboard so sellers can see when they're under- or over-priced relative to the model.
How It All Fits Together
The five systems are not independent. They form a single intelligence pipeline that runs the moment a dataset is uploaded:
From the seller's perspective: drop the file, fill out the listing, click publish. Everything else happens automatically. From the buyer's perspective: every listing now includes the answers to questions you used to have to ask the seller individually.
What This Means for Sellers
You don't have to do anything. Every dataset you publish from now on will go through the full pipeline automatically. Existing flagship datasets have already been backfilled.
You also get a new Intelligence panel in your seller dashboard showing:
- Demand gaps — searches in your category that returned zero results, with example queries (these are direct dataset opportunities)
- Outcome stats — real training results reported by buyers, with average improvement percentages
- Valuation deltas — your asking price vs the AI-estimated fair value for each of your datasets
What This Means for Buyers
When you open a dataset page, click the new Intelligence tab to see all five reports for that dataset. You'll find:
- The contamination certificate hash and which benchmarks were checked
- The full Gebru datasheet, expandable inline
- The compliance report with scores and per-framework status
- The valuation breakdown with all five factor scores
- Aggregated training outcomes from other buyers (if any)
You can also report your own training results via the new Report results button on every completed purchase in your buyer dashboard. Reported outcomes flow back into the LQS calibration engine and improve future quality scoring for that category.
This is the proprietary infrastructure that makes LabelSets defensible. Anyone can build a marketplace; what makes one valuable is the trust layer underneath. We've spent the last quarter building it. Browse the marketplace → and click into any listing to see the new Intelligence tab in action.
What's Next
The intelligence stack is live as of today. Over the coming weeks we're shipping:
- Outcome-driven LQS calibration — weekly Pearson correlation between buyer-reported outcomes and LQS dimension scores will tune the quality scoring per category
- Public contamination certificate pages — shareable URLs for each cert so buyers can link them in compliance reviews
- Secure compute jobs — train-without-download infrastructure for sellers who want stronger IP protection (queue is built; provider integration in progress)
If you sell data on LabelSets, your existing listings already have the new badges where applicable, and your next upload will auto-generate the full intelligence package. If you buy data, the next time you open any listing you'll see the Intelligence tab with everything in it.
Questions? Ping us at support@labelsets.ai.