Fraud detection is one of the highest-value ML applications in production. Banks, e-commerce platforms, and fintechs spend billions on it — yet finding quality labeled fraud detection training data is surprisingly hard.
Public datasets like the Kaggle credit card fraud dataset are useful for learning, but they have serious limitations: they're anonymized to the point of being uninterpretable, they're years old, and they don't reflect the fraud patterns hitting systems today. This guide covers what to look for in a fraud detection dataset and where to actually get one.
Why Fraud Detection Data Is Hard to Find
Three forces make fraud datasets scarce:
- Labeling is expensive. Real fraud labels require manual review by fraud analysts, chargebacks confirmed over months, and cross-referencing with external databases. This isn't automated.
- Data is sensitive. Fraud data contains PII — card numbers, names, IP addresses — that companies can't share externally without significant anonymization work.
- Fraud patterns change. A dataset from 2020 reflects fraud techniques from 2020. Adversarial drift means yesterday's model gets fooled by tomorrow's fraud ring.
This is why synthetic fraud datasets — built to statistically mirror real fraud patterns without containing real customer data — have become the practical standard for model development.
What Makes a Good Fraud Detection Dataset
1. Realistic Class Imbalance
Real fraud rates are typically 0.1% to 3% of transactions, depending on the channel. A dataset with 50% fraud is useless for training a production classifier — your model will never see that distribution in the wild. Look for datasets that preserve realistic imbalance, and make sure you know the fraud rate before buying.
2. Rich Feature Set
The most predictive fraud signals are behavioral and contextual, not just the transaction amount. A production-grade fraud dataset should include:
- Transaction amount, time, and merchant category
- Device fingerprint features (OS, browser, mobile vs. desktop)
- Geolocation signals (country, distance from home)
- Velocity features: number of transactions in the last 1h, 24h, 7d
- Account age and historical behavior features
- Binary fraud label (and ideally a fraud type label: card-not-present, account takeover, etc.)
3. Temporal Ordering
Fraud detection is fundamentally a time-series problem. Your dataset needs to be ordered chronologically so you can do proper time-based train/test splits — never random splits on fraud data, or you'll have massive test leakage from future data into your training set.
4. Sufficient Scale
At a 1% fraud rate, you need at least 5,000 positive fraud examples to train a reliable model — which means at least 500,000 total transactions. Smaller datasets can work with SMOTE oversampling, but you'll sacrifice generalization.
| Dataset Size | Fraud Rate | Fraud Examples | Usability |
|---|---|---|---|
| < 10K rows | Any | < 300 | Learning only — not production |
| 50K–100K rows | 1–5% | 500–5,000 | Good for prototyping |
| 100K+ rows | 0.5–2% | 1,000–2,000+ | Production-viable with SMOTE |
| 500K+ rows | 0.1–1% | 500–5,000+ | Production-grade |
E-Commerce vs. Financial Fraud Datasets
The fraud detection problem looks quite different depending on the domain:
E-commerce fraud focuses on CNP (card-not-present) fraud, account takeover, return abuse, and promo abuse. Key features are shipping address, device data, purchase history, and whether the order is for a high-resale-value item.
Credit card / financial fraud focuses on card skimming, synthetic identity fraud, and money mule detection. Key features are transaction velocity, geographic anomalies, and merchant category codes (MCCs).
Network intrusion detection is adjacent — the labels are attack/benign rather than fraud/legitimate, but the ML problem (rare positive class, temporal data, high-dimensional features) is structurally similar.
LabelSets carries a 100,000-row E-Commerce Fraud Detection dataset ($49) with realistic 2.3% fraud rate, 28 behavioral features, and temporal ordering — ready to drop into scikit-learn or XGBoost. We also have a Network Intrusion Detection dataset (200K rows, $59) for security ML use cases. Browse all financial datasets →
Handling Class Imbalance in Training
Even with a good dataset, you'll need to address imbalance. The main approaches:
SMOTE Oversampling
Synthetic Minority Over-sampling Technique creates synthetic fraud examples by interpolating between existing fraud cases. Use it on the training set only — never on the test set.
Class Weights
Most sklearn classifiers accept a class_weight='balanced' parameter that automatically upweights the minority class. This is the simplest approach and works well with gradient boosting.
Threshold Tuning
Don't use 0.5 as your classification threshold. Tune it using a precision-recall curve against your business requirements: what false positive rate (legitimate transactions blocked) is acceptable relative to the fraud you catch?
Evaluation Metrics
Accuracy is meaningless for fraud detection. Use these instead:
- Precision-Recall AUC — best overall metric for imbalanced classification
- F1 score at your operating threshold — for a single-number business metric
- False Positive Rate — what % of good transactions does your model block?
- Recall at k% FPR — how much fraud do you catch with an acceptable block rate?
Where to Get Fraud Detection Training Data
Your options, ranked by practicality:
- Internal data — Best if you have it, but most startups don't have enough labeled fraud history for months or years.
- Dataset marketplaces — Pre-labeled, ready to use. LabelSets carries several fraud datasets across e-commerce, credit, and network security domains.
- Synthetic data generation — Tools like CTGAN or Gretel.ai can generate synthetic fraud data, but require a seed dataset and careful validation.
- Public datasets — Kaggle's Credit Card Fraud (2013) and IEEE-CIS datasets are useful for benchmarking but not for production training due to age and anonymization.
Frequently Asked Questions
What is a good size for a fraud detection dataset?
For binary fraud classification, aim for at least 50,000 total transactions with a realistic fraud rate (0.1%–5%). Anything smaller tends to overfit, especially with tree-based models. For deep learning approaches, you'll want 500K+ rows.
How do you handle class imbalance in fraud detection?
Use SMOTE oversampling on the training set, class-weighted loss functions, or both. Evaluate with precision-recall AUC — not accuracy, which is misleading when 99% of your data is non-fraud.
What features should a fraud detection dataset have?
Transaction amount, timestamp, merchant category, device/IP features, velocity features (transactions in last 1h/24h), and a binary fraud label. Temporal ordering across the full dataset is critical for valid train/test splits.
Can I use synthetic fraud data for production models?
Yes — synthetic data generated from real fraud distributions (not random) works well, especially for bootstrapping a model before you have enough production labels. Validate on real labeled data whenever possible.