Fraud detection is one of the highest-value ML applications in production. Banks, e-commerce platforms, and fintechs spend billions on it — yet finding quality labeled fraud detection training data is surprisingly hard.

Public datasets like the Kaggle credit card fraud dataset are useful for learning, but they have serious limitations: they're anonymized to the point of being uninterpretable, they're years old, and they don't reflect the fraud patterns hitting systems today. This guide covers what to look for in a fraud detection dataset and where to actually get one.

Why Fraud Detection Data Is Hard to Find

Three forces make fraud datasets scarce:

This is why synthetic fraud datasets — built to statistically mirror real fraud patterns without containing real customer data — have become the practical standard for model development.

What Makes a Good Fraud Detection Dataset

1. Realistic Class Imbalance

Real fraud rates are typically 0.1% to 3% of transactions, depending on the channel. A dataset with 50% fraud is useless for training a production classifier — your model will never see that distribution in the wild. Look for datasets that preserve realistic imbalance, and make sure you know the fraud rate before buying.

2. Rich Feature Set

The most predictive fraud signals are behavioral and contextual, not just the transaction amount. A production-grade fraud dataset should include:

3. Temporal Ordering

Fraud detection is fundamentally a time-series problem. Your dataset needs to be ordered chronologically so you can do proper time-based train/test splits — never random splits on fraud data, or you'll have massive test leakage from future data into your training set.

4. Sufficient Scale

At a 1% fraud rate, you need at least 5,000 positive fraud examples to train a reliable model — which means at least 500,000 total transactions. Smaller datasets can work with SMOTE oversampling, but you'll sacrifice generalization.

Dataset SizeFraud RateFraud ExamplesUsability
< 10K rowsAny< 300Learning only — not production
50K–100K rows1–5%500–5,000Good for prototyping
100K+ rows0.5–2%1,000–2,000+Production-viable with SMOTE
500K+ rows0.1–1%500–5,000+Production-grade

E-Commerce vs. Financial Fraud Datasets

The fraud detection problem looks quite different depending on the domain:

E-commerce fraud focuses on CNP (card-not-present) fraud, account takeover, return abuse, and promo abuse. Key features are shipping address, device data, purchase history, and whether the order is for a high-resale-value item.

Credit card / financial fraud focuses on card skimming, synthetic identity fraud, and money mule detection. Key features are transaction velocity, geographic anomalies, and merchant category codes (MCCs).

Network intrusion detection is adjacent — the labels are attack/benign rather than fraud/legitimate, but the ML problem (rare positive class, temporal data, high-dimensional features) is structurally similar.

LabelSets carries a 100,000-row E-Commerce Fraud Detection dataset ($49) with realistic 2.3% fraud rate, 28 behavioral features, and temporal ordering — ready to drop into scikit-learn or XGBoost. We also have a Network Intrusion Detection dataset (200K rows, $59) for security ML use cases. Browse all financial datasets →

Handling Class Imbalance in Training

Even with a good dataset, you'll need to address imbalance. The main approaches:

SMOTE Oversampling

Synthetic Minority Over-sampling Technique creates synthetic fraud examples by interpolating between existing fraud cases. Use it on the training set only — never on the test set.

Class Weights

Most sklearn classifiers accept a class_weight='balanced' parameter that automatically upweights the minority class. This is the simplest approach and works well with gradient boosting.

Threshold Tuning

Don't use 0.5 as your classification threshold. Tune it using a precision-recall curve against your business requirements: what false positive rate (legitimate transactions blocked) is acceptable relative to the fraud you catch?

Evaluation Metrics

Accuracy is meaningless for fraud detection. Use these instead:

Where to Get Fraud Detection Training Data

Your options, ranked by practicality:

  1. Internal data — Best if you have it, but most startups don't have enough labeled fraud history for months or years.
  2. Dataset marketplaces — Pre-labeled, ready to use. LabelSets carries several fraud datasets across e-commerce, credit, and network security domains.
  3. Synthetic data generation — Tools like CTGAN or Gretel.ai can generate synthetic fraud data, but require a seed dataset and careful validation.
  4. Public datasets — Kaggle's Credit Card Fraud (2013) and IEEE-CIS datasets are useful for benchmarking but not for production training due to age and anonymization.

Frequently Asked Questions

What is a good size for a fraud detection dataset?

For binary fraud classification, aim for at least 50,000 total transactions with a realistic fraud rate (0.1%–5%). Anything smaller tends to overfit, especially with tree-based models. For deep learning approaches, you'll want 500K+ rows.

How do you handle class imbalance in fraud detection?

Use SMOTE oversampling on the training set, class-weighted loss functions, or both. Evaluate with precision-recall AUC — not accuracy, which is misleading when 99% of your data is non-fraud.

What features should a fraud detection dataset have?

Transaction amount, timestamp, merchant category, device/IP features, velocity features (transactions in last 1h/24h), and a binary fraud label. Temporal ordering across the full dataset is critical for valid train/test splits.

Can I use synthetic fraud data for production models?

Yes — synthetic data generated from real fraud distributions (not random) works well, especially for bootstrapping a model before you have enough production labels. Validate on real labeled data whenever possible.