Fine-tuning a large language model is increasingly a standard engineering task, but the dataset question — what data, in what format, how much — remains poorly understood outside a small circle of practitioners who've done it repeatedly. The right answer depends heavily on what you're actually trying to accomplish: teaching the model to follow instructions is a different problem from adapting it to a specialized domain, which is a different problem from shaping its response preferences through RLHF.

This guide covers the best NLP datasets for each LLM fine-tuning objective, what makes them good, and where to find commercially licensed options when public datasets won't clear legal review.

The Three Types of Fine-Tuning (and Why They Need Different Data)

The most common confusion in LLM fine-tuning comes from conflating three distinct objectives that each require different data strategies:

Instruction Tuning Datasets

Stanford Alpaca (52K)

52K instruction-response pairs · CC BY-NC-SA · GPT-3.5-generated
Easy to use Non-commercial

Alpaca was the dataset that popularized cheap instruction tuning — 52,000 instruction-response pairs generated from GPT-3.5 using a self-instruct pipeline, released by Stanford. The quality is reasonable and it demonstrated that instruction tuning could work with relatively small datasets. The hard limitations for production use: the NC license prevents commercial use, and GPT-generated data has known issues (verbose responses, occasional hallucinations, style that over-fits to a GPT-3.5 response profile). Good for research experiments and baseline comparisons; not recommended for commercial products.

FLAN Collection

1,800+ NLP tasks · Apache 2.0 · Google Research
Commercial use Multi-task

The FLAN collection (Finetuned Language Net) is Google's multi-task instruction fine-tuning dataset covering 1,800+ NLP tasks — classification, summarization, translation, QA, reasoning, and more. The key advantage over Alpaca-style datasets: FLAN tasks include chain-of-thought reasoning examples and cover a wide range of capabilities rather than just instruction-following. Apache 2.0 license means commercial use is permitted. Useful for building general-purpose instruction following on top of a base model, or as a component in a larger fine-tuning curriculum.

LabelSets Domain-Specific Instruction JSONL Datasets — Browse NLP Datasets

Domain-specific · Commercial license · LQS quality-scored · Instant download
Commercial license Domain-specific

For commercial LLM applications, generic instruction datasets often aren't enough. A customer support chatbot trained only on FLAN will produce responses that feel generic. A legal document analyzer needs training examples grounded in actual legal text and tasks. LabelSets hosts domain-specific JSONL instruction datasets across verticals including customer support, financial analysis, legal document handling, and technical Q&A. All commercially licensed, LQS quality-scored, and downloadable immediately. The JSONL format is directly compatible with the Hugging Face trl library's SFTTrainer and with OpenAI fine-tuning API format.

Classification Datasets

GLUE / SuperGLUE Benchmarks

Multiple tasks · Mixed licenses · Sentence-level NLU
Standard benchmark

GLUE and SuperGLUE remain the standard benchmarks for natural language understanding — covering sentiment analysis, textual entailment, question answering, and coreference resolution. They're useful for evaluating whether your fine-tuned model has retained general NLU capabilities rather than over-fitting to your specific task. Most GLUE tasks are free for research; commercial use varies by the underlying dataset (each task has a separate source). For fine-tuning rather than evaluation, they're best used as part of a curriculum alongside domain-specific data.

Amazon Reviews (Polarity / Multi-class)

3.6M reviews · Sentiment / classification · Research use
Large scale Amazon ToS restricts redistribution

Amazon product reviews are widely used for sentiment classification fine-tuning, covering millions of examples across multiple categories and star ratings. The data is rich, the sentiment labels are implicit (star ratings), and the size allows for robust training. The legal complication: Amazon's terms of service restrict data redistribution and commercial use of scraped review data. Using it for internal research is generally considered low-risk; using it to train a commercial product and shipping that product has meaningful legal exposure. Worth being aware of this before committing your fine-tuning pipeline to it.

Named Entity Recognition (NER) Datasets

CoNLL-2003

English/German news · 4 entity types · Research license · Sentence-level NER
Standard NER benchmark

CoNLL-2003 is the standard NER benchmark for English (and German), tagging person, organization, location, and miscellaneous entity types in Reuters newswire text. It's been used to train and evaluate virtually every NER model in the last two decades. The dataset is small by modern standards (14,000 training sentences) but clean and well-validated. License is technically Reuters/CoNLL and is research-use only, though it's widely used in commercial NLP pipelines without explicit enforcement. For domain-specific NER (biomedical, legal, financial), you'll need a different source — the entity types and text distribution are news-only.

Question Answering Datasets

SQuAD 2.0

150K QA pairs · CC BY-SA 4.0 · Reading comprehension
High quality annotations Free with attribution

SQuAD 2.0 (Stanford Question Answering Dataset) contains 150,000 question-answer pairs derived from Wikipedia articles, including 50,000 adversarial questions where the answer is not present in the context (testing model ability to say "I don't know"). Annotation quality is high — questions were written by crowdworkers who read the passage, and answers are human-validated. CC BY-SA 4.0 allows commercial use with attribution. For building extractive QA systems or for training LLMs to handle document-grounded question answering, SQuAD 2.0 is the right starting point.

Natural Questions (Google)

320K QA pairs · CC BY-SA 3.0 · Open-domain QA
Realistic queries Free with attribution

Natural Questions is Google's dataset of real search queries paired with Wikipedia articles containing the answer. Unlike SQuAD (where questions are written to match a passage), Natural Questions starts from actual user queries — which makes it more realistic for open-domain QA applications. Includes both short answers (spans) and long answers (full paragraphs). CC BY-SA 3.0. Particularly useful for training and evaluating retrieval-augmented generation (RAG) pipelines where you want the model to handle the kinds of questions real users actually ask.

Code Datasets

The Stack / StarCoder Data

6.4TB code · License-filtered GitHub · Multi-language
Large scale License-filtered

The Stack (from BigCode) is the largest permissively licensed code dataset available — 6.4TB of source code across 358 programming languages, filtered to include only repositories with permissive licenses (MIT, Apache, BSD, etc.). It's the training data behind StarCoder and related code models. For fine-tuning a code-specific LLM or adding programming capability to a general model, this is the primary public option. The dataset was built with opt-out mechanisms for code authors, and BigCode has been transparent about the curation methodology — which makes it more defensible for commercial use than raw GitHub scrapes.

Data Quantity: How Much Do You Actually Need?

The short answer: far less than you think for instruction tuning, more than you think for domain adaptation.

For supervised fine-tuning on instruction following: 1,000–10,000 high-quality instruction-response pairs is often sufficient to meaningfully change model behavior. The key word is quality — 1,000 carefully written, consistent examples outperform 100,000 noisy, repetitive ones. This is the finding from models like Alpaca, Dolly, and subsequent research: data quality dominates data quantity in the instruction tuning regime.

For domain adaptation via continued pretraining: you need enough domain-specific tokens to shift the model's prior. A rule of thumb: at least 10–50M tokens of high-quality domain text to see meaningful behavior change. Less than that, and you're likely to see catastrophic forgetting of general capabilities without compensatory domain gains.

Looking for commercially licensed NLP datasets ready for fine-tuning? Browse JSONL instruction pairs, classification datasets, and domain-specific corpora at LabelSets NLP datasets. Every listing includes an LQS quality score and format preview before you buy.