What NLP dataset types are available on LabelSets?

LabelSets offers datasets for sentiment analysis, named entity recognition (NER), text classification, question-answering pairs, summarization, intent detection, language translation, and LLM instruction fine-tuning.

What formats are NLP datasets available in?

NLP datasets are available in CSV, JSONL (newline-delimited JSON), Parquet, Arrow, and plain JSON. JSONL is most common for LLM fine-tuning workflows.

How do I sell my NLP dataset on LabelSets?

Upload your CSV or JSONL file, our automated pipeline validates structure and checks for PII, you set a price, and start earning. Sellers keep 85% of every sale.

NLP Datasets for LLM Fine-Tuning — Download in Minutes

Featured datasets

NLP corpora with verifiable provenance.

Live marketplace listings filtered to NLP. Every card shows signed LQS score, contamination-clean flag against public evals, and originality signal.

Tasks covered

From classic classification to RLHF preferences.

Sentiment & aspect

Sentence- and aspect-level labels across reviews, social media, support tickets. Three-class and multi-dimensional sentiment schemas.

schema · pos/neu/neg · multi-aspect

Named entity recognition

Token-level span annotations for PER, ORG, LOC, DATE, and custom entity types. BIO + BILOU tagging schemes validated on upload.

schemes · BIO · BILOU · CoNLL

LLM instruction + RLHF

Instruction-following, chat, and preference datasets formatted for fine-tuning open-weight models. Role/content schema + tool-use support.

targets · LLaMA · Mistral · Phi · Qwen

Question answering

Extractive and abstractive Q&A pairs with context passages. SQuAD-style span answers + conversational multi-turn formats.

formats · SQuAD · NQ · conversational

Text classification

Single- and multi-label datasets for topic categorization, spam detection, intent classification. Class-balance metrics on the cert.

metrics · class balance · CI

Translation + summarization

Parallel corpora for MT; reference summaries for abstractive summarization. Language-pair + domain metadata preserved.

pairs · 40+ lang pairs

JSONL · newline-delimited CSV · headers required Parquet · columnar Arrow · zero-copy JSON · plain

What LabelSets adds

Cert fields designed for LLM labs.

LabelSets isn't another dataset dump. Every NLP corpus carries a signed cert built for the questions an eval team or procurement officer actually asks.

Benchmark contamination registry

Every NLP dataset hashed + n-gram matched against 40+ public benchmarks. Per-benchmark overlap % + contamination-clean flag on the cert.

covers · MMLU · HumanEval · GSM8K · HellaSwag · …

Automated PII scan

Every upload runs through entity-recognition + regex PII detection before publication. Sellers are contractually required to anonymize or remove PII.

field · pii_scanned · timestamp

Ed25519-signed provenance

Every cert carries a public-key signature + fingerprint. Buyers verify at /verify any time. Revocation registry handles post-facto license flags.

fingerprint · aa4c070af907e2ea

FAQ

Questions eval teams actually ask.

Sentiment analysis, NER, text classification, Q&A pairs, summarization, intent detection, dialogue, machine translation pairs, and LLM instruction-tuning / RLHF preference datasets. Filter on the browse page by task tag.

CSV, JSONL, Parquet, Arrow, and plain JSON. JSONL is the most common format for LLM fine-tuning workflows and is natively supported by Hugging Face Datasets, pandas, polars, and the major fine-tuning frameworks.

Yes. Every dataset runs through automated PII scanning before publication. Datasets that pass display a signed pii_scanned flag on the cert. Sellers are contractually required to remove or anonymize personal information before uploading.

Yes. Many sellers offer instruction-following, chat, and domain-specific datasets formatted for fine-tuning open-weight models including LLaMA 3, Mistral, Phi, and Qwen. Role/content schema and tool-use formats are validated on upload.

Every NLP dataset is hashed and n-gram matched against 40+ public evaluation benchmarks (MMLU, HumanEval, HellaSwag, GSM8K, SQuAD, MS MARCO, GLUE, and more). Per-benchmark overlap percentages and the is_contamination_clean flag are embedded in the signed cert.

Upload your CSV or JSONL file, the pipeline validates structure and scans for PII + benchmark contamination, you set a price, and buyers can purchase instantly. Sellers keep 85% of every sale. No listing fees, no subscriptions.

Browse all NLP datasets.

Live marketplace filtered by LQS score, contamination-clean flag, format, and license. Looking for public corpora? See the curated catalog (MMLU, The Pile, C4, MS MARCO, GLUE, Wikipedia) with LQS scores.

Browse datasets → Sell your dataset

Related categories

Public Dataset Catalog Computer Vision Audio & Speech Financial Data Medical Imaging Autonomous Vehicles

Text corpora that won't contaminate your eval.

NLP corpora with verifiable provenance.

From classic classification to RLHF preferences.

Cert fields designed for LLM labs.

Questions eval teams actually ask.

Browse all NLP datasets.