NLP & Text

Text corpora that won't contaminate your eval.

Instruction-tuning, RLHF, sentiment, NER, Q&A. CSV, JSONL, Parquet. Every dataset is hashed against MMLU, HumanEval, HellaSwag, GSM8K and 40+ other public benchmarks so your reported numbers hold up.

Formats
5
JSONL · CSV · Parquet · Arrow · JSON
Benchmark screen
40+
MMLU · HumanEval · HellaSwag · GSM8K · SQuAD · MS MARCO
PII scan
100%
Automated pre-publication scan · seller contractually bound
Seller payout
85%
Signed sale · 15% platform fee · instant download
Contamination + licensing fields mapped into every cert
MMLU-clean
hash-screened
HumanEval-clean
code bench
EU AI Act
Art. 10 aligned
GDPR
Art. 28 + SCCs
Ed25519
signed cert
Featured datasets

NLP corpora with verifiable provenance.

Live marketplace listings filtered to NLP. Every card shows signed LQS score, contamination-clean flag against public evals, and originality signal.

Tasks covered

From classic classification to RLHF preferences.

Sentiment & aspect
Sentence- and aspect-level labels across reviews, social media, support tickets. Three-class and multi-dimensional sentiment schemas.
schema · pos/neu/neg · multi-aspect
Named entity recognition
Token-level span annotations for PER, ORG, LOC, DATE, and custom entity types. BIO + BILOU tagging schemes validated on upload.
schemes · BIO · BILOU · CoNLL
LLM instruction + RLHF
Instruction-following, chat, and preference datasets formatted for fine-tuning open-weight models. Role/content schema + tool-use support.
targets · LLaMA · Mistral · Phi · Qwen
Question answering
Extractive and abstractive Q&A pairs with context passages. SQuAD-style span answers + conversational multi-turn formats.
formats · SQuAD · NQ · conversational
Text classification
Single- and multi-label datasets for topic categorization, spam detection, intent classification. Class-balance metrics on the cert.
metrics · class balance · CI
Translation + summarization
Parallel corpora for MT; reference summaries for abstractive summarization. Language-pair + domain metadata preserved.
pairs · 40+ lang pairs
JSONL · newline-delimited CSV · headers required Parquet · columnar Arrow · zero-copy JSON · plain
What LabelSets adds

Cert fields designed for LLM labs.

LabelSets isn't another dataset dump. Every NLP corpus carries a signed cert built for the questions an eval team or procurement officer actually asks.

Benchmark contamination registry
Every NLP dataset hashed + n-gram matched against 40+ public benchmarks. Per-benchmark overlap % + contamination-clean flag on the cert.
covers · MMLU · HumanEval · GSM8K · HellaSwag · …
Automated PII scan
Every upload runs through entity-recognition + regex PII detection before publication. Sellers are contractually required to anonymize or remove PII.
field · pii_scanned · timestamp
Ed25519-signed provenance
Every cert carries a public-key signature + fingerprint. Buyers verify at /verify any time. Revocation registry handles post-facto license flags.
fingerprint · aa4c070af907e2ea
FAQ

Questions eval teams actually ask.

Sentiment analysis, NER, text classification, Q&A pairs, summarization, intent detection, dialogue, machine translation pairs, and LLM instruction-tuning / RLHF preference datasets. Filter on the browse page by task tag.
CSV, JSONL, Parquet, Arrow, and plain JSON. JSONL is the most common format for LLM fine-tuning workflows and is natively supported by Hugging Face Datasets, pandas, polars, and the major fine-tuning frameworks.
Yes. Every dataset runs through automated PII scanning before publication. Datasets that pass display a signed pii_scanned flag on the cert. Sellers are contractually required to remove or anonymize personal information before uploading.
Yes. Many sellers offer instruction-following, chat, and domain-specific datasets formatted for fine-tuning open-weight models including LLaMA 3, Mistral, Phi, and Qwen. Role/content schema and tool-use formats are validated on upload.
Every NLP dataset is hashed and n-gram matched against 40+ public evaluation benchmarks (MMLU, HumanEval, HellaSwag, GSM8K, SQuAD, MS MARCO, GLUE, and more). Per-benchmark overlap percentages and the is_contamination_clean flag are embedded in the signed cert.
Upload your CSV or JSONL file, the pipeline validates structure and scans for PII + benchmark contamination, you set a price, and buyers can purchase instantly. Sellers keep 85% of every sale. No listing fees, no subscriptions.

Browse all NLP datasets.

Live marketplace filtered by LQS score, contamination-clean flag, format, and license. Looking for public corpora? See the curated catalog (MMLU, The Pile, C4, MS MARCO, GLUE, Wikipedia) with LQS scores.

Related categories