Fine-tuning a large language model on your own data used to require a research team and months of work. Today, with tools like LoRA, QLoRA, and the Hugging Face ecosystem, a single engineer can fine-tune a 7B model on a consumer GPU. But the bottleneck hasn't changed: getting the right training data.
This guide covers everything you need to know about LLM fine-tuning datasets — the correct format, how much data you actually need, and where to source domain-specific instruction data when you don't have enough of your own.
What Is Fine-Tuning Data?
Fine-tuning adjusts a pre-trained model's behavior to a specific task or domain. The data you use tells the model: "when you see inputs like this, produce outputs like this." The most common form is instruction tuning — teaching the model to follow instructions in a particular domain or style.
Unlike pre-training (which requires terabytes of raw text), fine-tuning works with thousands to tens of thousands of carefully written examples. Quality matters enormously more than quantity.
The JSONL Format: What Your Data Must Look Like
The universal format for instruction fine-tuning is JSONL — one JSON object per line. There are two main schemas:
Alpaca-style (instruction + input + output)
{"instruction": "Summarize the following customer complaint in one sentence.", "input": "I ordered a laptop on March 1st and it still hasn't arrived. I've emailed support three times with no response...", "output": "Customer reports a late delivery with no response from support after three contact attempts."}
{"instruction": "Classify the sentiment of this review.", "input": "Absolutely fantastic product, exceeded every expectation.", "output": "Positive"}
Chat/messages style (OpenAI format)
{"messages": [{"role": "system", "content": "You are a medical assistant. Answer questions accurately and conservatively."}, {"role": "user", "content": "What is the first-line treatment for hypertension?"}, {"role": "assistant", "content": "The first-line treatment for hypertension typically includes lifestyle modifications such as reducing sodium intake, increasing physical activity, and losing weight if overweight. If lifestyle changes are insufficient, ACE inhibitors, ARBs, or thiazide diuretics are the standard pharmacological first line."}]}
Use the chat/messages format for conversational models (LLaMA 3, Mistral, Phi-3). Use alpaca-style for task-focused fine-tunes. Most fine-tuning frameworks (axolotl, LLaMA-Factory, Unsloth) support both.
How Much Data Do You Actually Need?
The honest answer: less than you think, as long as it's high quality.
- 500–1,000 examples — Enough to noticeably shift behavior in a specific task (e.g., response format, tone, domain vocabulary)
- 1,000–5,000 examples — Good for task-specific fine-tuning (customer support, code review, domain Q&A)
- 5,000–20,000 examples — Strong domain adaptation (medical, legal, financial domains)
- 50,000+ examples — Full instruction-following overhaul; competing with base instruction models
The most common mistake is chasing volume. A fine-tune on 500 expert-curated medical Q&A pairs will outperform one on 50,000 scraped, noisy examples. Garbage in, garbage out — and bad fine-tuning data will actively degrade model performance.
Domain-Specific vs. General Fine-Tuning Data
General instruction-following datasets (Alpaca, Dolly, FLAN) make models more capable instruction followers. Domain-specific datasets make models expert in a particular field. You often want both — start with a general instruction-tuned base model, then fine-tune on your domain data.
High-value fine-tuning domains in 2026
- Medical / Clinical — Drug interactions, diagnosis support, lab interpretation. Requires expert annotation. High barrier = high value.
- Legal — Contract analysis, clause classification, legal Q&A. Demand is enormous from law firms and legal tech startups.
- Customer Service — Support ticket classification and response generation. Large volume needed, but can often be partially automated.
- Code — Bugfix generation, code review, SQL optimization. GitHub Copilot competitors dominate here.
- Finance — Portfolio advice, tax optimization, earnings analysis. Highly regulated — dataset quality and sourcing matters for compliance.
LabelSets carries ready-made fine-tuning datasets for these exact domains. Our Customer Service Fine-Tuning dataset (5,000 instruction-response pairs, JSONL) is the most popular at $149. We're also shipping Clinical Medical QA, Legal Document Analysis, and Financial Advisory datasets in the coming weeks. Browse NLP & fine-tuning datasets →
Data Quality Checklist
Before fine-tuning on any dataset — whether you built it or bought it — run through this checklist:
- Response accuracy — Are the output labels/responses factually correct? One wrong medical answer in 100 will still degrade trust.
- Consistency — Do similar inputs get similar outputs? Contradictions confuse the model.
- Format consistency — All examples should use the same JSONL schema. Mixed formats cause silent training failures.
- Length distribution — Extremely short outputs (<5 tokens) or extremely long ones (>2048 tokens) can destabilize training. Check the distribution.
- No PII — Names, emails, phone numbers, and SSNs in training data create liability. Run a PII scan.
- No near-duplicates — Duplicate or near-duplicate examples overfit and waste training budget. Deduplicate with MinHash or exact matching.
Where to Get Fine-Tuning Data
Option 1: Build it yourself
Use your domain experts to write instruction-response pairs. Expensive and slow, but produces the highest-quality data. Best for high-stakes domains where accuracy matters (medical, legal).
Option 2: Distill from larger models
Use a capable frontier model (Gemini, Claude) to generate responses to your instructions. Check licensing — OpenAI prohibits using GPT outputs to train competing models. Permissively-licensed frontier models exist for this use case.
Option 3: Buy a pre-labeled dataset
The fastest path from zero to fine-tuning. Marketplaces like LabelSets carry expert-curated JSONL datasets across medical, legal, customer service, code, and financial domains — tested for format consistency and PII-scanned before listing.
Option 4: Adapt public datasets
Hugging Face Hub has thousands of datasets. Many are usable for fine-tuning with a reformatting step — just check the license (CC-BY is fine, non-commercial licenses may block your use case).
Frequently Asked Questions
How much data do you need to fine-tune an LLM?
For instruction-following tasks, 1,000–10,000 high-quality examples is the practical range. Quality matters far more than quantity — 500 carefully curated examples will outperform 50,000 noisy ones.
What format does fine-tuning data need to be in?
JSONL (newline-delimited JSON) is the standard. Use alpaca-style (instruction/input/output fields) or OpenAI chat format (messages array with role/content). Most fine-tuning frameworks support both.
Can I use ChatGPT outputs to fine-tune another LLM?
OpenAI's terms of service prohibit using ChatGPT outputs to train competing models. For open-weight models, check the specific model license. Many fine-tuning datasets are generated using permissively-licensed models like Mistral or open Claude variants.
How do I know if my fine-tune is working?
Hold out 10–20% of your dataset as a validation set. Monitor validation loss during training — it should decrease, then plateau. If it starts increasing, you're overfitting. Qualitatively evaluate outputs on 20–50 held-out prompts before deploying.