Fine-tuning a large language model is increasingly a standard engineering task, but the dataset question — what data, in what format, how much — remains poorly understood outside a small circle of practitioners who've done it repeatedly. The right answer depends heavily on what you're actually trying to accomplish: teaching the model to follow instructions is a different problem from adapting it to a specialized domain, which is a different problem from shaping its response preferences through RLHF.
This guide covers the best NLP datasets for each LLM fine-tuning objective, what makes them good, and where to find commercially licensed options when public datasets won't clear legal review.
The Three Types of Fine-Tuning (and Why They Need Different Data)
The most common confusion in LLM fine-tuning comes from conflating three distinct objectives that each require different data strategies:
- Instruction tuning (supervised fine-tuning / SFT) — Teaching a base model to follow instructions and behave helpfully. Requires instruction-response pairs, typically in JSONL format with
{"instruction": "...", "output": "..."}or chat-style{"messages": [...]}structure. Even a few thousand high-quality examples can meaningfully shift behavior. - Continued pretraining (domain adaptation) — Extending a model's knowledge in a specific domain (legal, medical, financial, code). Requires large volumes of domain-relevant text — tens to hundreds of millions of tokens. Quality matters, but the format is simpler: plain text or minimal JSONL.
- RLHF / preference tuning (DPO, PPO) — Shaping model preferences using human-annotated comparisons of responses. Requires paired data: a prompt, a "chosen" response, and a "rejected" response. Human judgment quality is critical; volume matters less than annotation consistency.
Instruction Tuning Datasets
Stanford Alpaca (52K)
Easy to use Non-commercialAlpaca was the dataset that popularized cheap instruction tuning — 52,000 instruction-response pairs generated from GPT-3.5 using a self-instruct pipeline, released by Stanford. The quality is reasonable and it demonstrated that instruction tuning could work with relatively small datasets. The hard limitations for production use: the NC license prevents commercial use, and GPT-generated data has known issues (verbose responses, occasional hallucinations, style that over-fits to a GPT-3.5 response profile). Good for research experiments and baseline comparisons; not recommended for commercial products.
FLAN Collection
Commercial use Multi-taskThe FLAN collection (Finetuned Language Net) is Google's multi-task instruction fine-tuning dataset covering 1,800+ NLP tasks — classification, summarization, translation, QA, reasoning, and more. The key advantage over Alpaca-style datasets: FLAN tasks include chain-of-thought reasoning examples and cover a wide range of capabilities rather than just instruction-following. Apache 2.0 license means commercial use is permitted. Useful for building general-purpose instruction following on top of a base model, or as a component in a larger fine-tuning curriculum.
LabelSets Domain-Specific Instruction JSONL Datasets — Browse NLP Datasets
Commercial license Domain-specificFor commercial LLM applications, generic instruction datasets often aren't enough. A customer support chatbot trained only on FLAN will produce responses that feel generic. A legal document analyzer needs training examples grounded in actual legal text and tasks. LabelSets hosts domain-specific JSONL instruction datasets across verticals including customer support, financial analysis, legal document handling, and technical Q&A. All commercially licensed, LQS quality-scored, and downloadable immediately. The JSONL format is directly compatible with the Hugging Face trl library's SFTTrainer and with OpenAI fine-tuning API format.
Classification Datasets
GLUE / SuperGLUE Benchmarks
Standard benchmarkGLUE and SuperGLUE remain the standard benchmarks for natural language understanding — covering sentiment analysis, textual entailment, question answering, and coreference resolution. They're useful for evaluating whether your fine-tuned model has retained general NLU capabilities rather than over-fitting to your specific task. Most GLUE tasks are free for research; commercial use varies by the underlying dataset (each task has a separate source). For fine-tuning rather than evaluation, they're best used as part of a curriculum alongside domain-specific data.
Amazon Reviews (Polarity / Multi-class)
Large scale Amazon ToS restricts redistributionAmazon product reviews are widely used for sentiment classification fine-tuning, covering millions of examples across multiple categories and star ratings. The data is rich, the sentiment labels are implicit (star ratings), and the size allows for robust training. The legal complication: Amazon's terms of service restrict data redistribution and commercial use of scraped review data. Using it for internal research is generally considered low-risk; using it to train a commercial product and shipping that product has meaningful legal exposure. Worth being aware of this before committing your fine-tuning pipeline to it.
Named Entity Recognition (NER) Datasets
CoNLL-2003
Standard NER benchmarkCoNLL-2003 is the standard NER benchmark for English (and German), tagging person, organization, location, and miscellaneous entity types in Reuters newswire text. It's been used to train and evaluate virtually every NER model in the last two decades. The dataset is small by modern standards (14,000 training sentences) but clean and well-validated. License is technically Reuters/CoNLL and is research-use only, though it's widely used in commercial NLP pipelines without explicit enforcement. For domain-specific NER (biomedical, legal, financial), you'll need a different source — the entity types and text distribution are news-only.
Question Answering Datasets
SQuAD 2.0
High quality annotations Free with attributionSQuAD 2.0 (Stanford Question Answering Dataset) contains 150,000 question-answer pairs derived from Wikipedia articles, including 50,000 adversarial questions where the answer is not present in the context (testing model ability to say "I don't know"). Annotation quality is high — questions were written by crowdworkers who read the passage, and answers are human-validated. CC BY-SA 4.0 allows commercial use with attribution. For building extractive QA systems or for training LLMs to handle document-grounded question answering, SQuAD 2.0 is the right starting point.
Natural Questions (Google)
Realistic queries Free with attributionNatural Questions is Google's dataset of real search queries paired with Wikipedia articles containing the answer. Unlike SQuAD (where questions are written to match a passage), Natural Questions starts from actual user queries — which makes it more realistic for open-domain QA applications. Includes both short answers (spans) and long answers (full paragraphs). CC BY-SA 3.0. Particularly useful for training and evaluating retrieval-augmented generation (RAG) pipelines where you want the model to handle the kinds of questions real users actually ask.
Code Datasets
The Stack / StarCoder Data
Large scale License-filteredThe Stack (from BigCode) is the largest permissively licensed code dataset available — 6.4TB of source code across 358 programming languages, filtered to include only repositories with permissive licenses (MIT, Apache, BSD, etc.). It's the training data behind StarCoder and related code models. For fine-tuning a code-specific LLM or adding programming capability to a general model, this is the primary public option. The dataset was built with opt-out mechanisms for code authors, and BigCode has been transparent about the curation methodology — which makes it more defensible for commercial use than raw GitHub scrapes.
Data Quantity: How Much Do You Actually Need?
The short answer: far less than you think for instruction tuning, more than you think for domain adaptation.
For supervised fine-tuning on instruction following: 1,000–10,000 high-quality instruction-response pairs is often sufficient to meaningfully change model behavior. The key word is quality — 1,000 carefully written, consistent examples outperform 100,000 noisy, repetitive ones. This is the finding from models like Alpaca, Dolly, and subsequent research: data quality dominates data quantity in the instruction tuning regime.
For domain adaptation via continued pretraining: you need enough domain-specific tokens to shift the model's prior. A rule of thumb: at least 10–50M tokens of high-quality domain text to see meaningful behavior change. Less than that, and you're likely to see catastrophic forgetting of general capabilities without compensatory domain gains.
Looking for commercially licensed NLP datasets ready for fine-tuning? Browse JSONL instruction pairs, classification datasets, and domain-specific corpora at LabelSets NLP datasets. Every listing includes an LQS quality score and format preview before you buy.