Medical imaging AI sits at the intersection of two difficult problems: the technical challenge of training models on limited, expensive-to-annotate data, and the regulatory challenge of navigating patient privacy requirements, institutional review, and commercial licensing in a domain where the stakes of a wrong prediction are high. This guide is honest about both.

We cover the compliance requirements you need to understand before using any medical imaging dataset, the major imaging modalities and what's available for each, and the distinction between datasets that require institutional access and those with truly open or commercial licenses.

Compliance First: HIPAA, IRB, and De-identification

Before looking at any specific dataset, understand the compliance landscape. This matters whether you're at a startup, an established medical device company, or an academic institution.

The practical upshot: for research and internal development, many publicly available datasets are accessible after a data use agreement (DUA). For commercial deployment, the requirements are stricter — you typically need data with explicit commercial use rights, or you need to license data directly from the originating institution. We flag which category each dataset falls into below.

X-ray Datasets

NIH Chest X-Ray14 (ChestXray14)

112,000 chest X-rays · 14 thoracic pathology labels · NIH license · Research DUA required
Large scale Research only, DUA required

ChestXray14 is one of the largest publicly available chest X-ray datasets, with 112,000 frontal-view X-rays from 30,000 unique patients labeled for 14 thoracic pathologies (pneumonia, effusion, pneumothorax, etc.). Labels were derived programmatically from radiology reports using NLP — not hand-labeled by radiologists. This creates a significant caveat: label accuracy is estimated at 80–85% for some conditions and lower for others. This is documented by the NIH itself. Models trained on ChestXray14 perform well by benchmark metrics but have been shown to have lower real-world accuracy than benchmark numbers suggest. Requires a data use agreement with NIH; not available for commercial product deployment without additional licensing.

CheXpert (Stanford)

224,000 chest X-rays · 14 labels · Stanford DUA · Uncertainty labels included
Uncertainty labels Research DUA required

CheXpert improves on ChestXray14 in one important way: it explicitly models label uncertainty. Because chest X-ray labeling is genuinely ambiguous (a radiologist may note "possible pneumonia" rather than a binary positive/negative), CheXpert labels include a U (uncertain) class for conditions where the report is ambiguous. This is more realistic than forcing binary labels on uncertain cases. The dataset has 224,000 X-rays with labels derived from radiology reports. Requires registration and a data use agreement with Stanford; commercial licensing is handled separately and requires direct engagement with Stanford OTL (Office of Technology Licensing).

CT Scan Datasets

LIDC-IDRI (Lung Image Database Consortium)

1,018 CT scans · 4 radiologist annotations per nodule · CC BY 3.0 · Publicly available
Multi-annotator consensus CC BY 3.0

LIDC-IDRI is among the best-annotated publicly available medical imaging datasets. Each CT scan was independently annotated by four radiologists, who marked and characterized pulmonary nodules — giving you inter-annotator agreement data that's rare in medical imaging. The CC BY 3.0 license is more permissive than most medical imaging datasets, allowing commercial use with attribution. The main limitation is size — 1,018 CT scans is small for modern deep learning without augmentation and transfer learning. But the annotation quality and multi-reader design make it valuable as a validation set and fine-tuning target even when pre-training on larger, noisier datasets.

CT Medical Image Segmentation Challenge Datasets (MICCAI)

Varies by year · Task-specific · Research licenses · High-quality annotations
Expert annotations Research use, varies by challenge

The MICCAI (Medical Image Computing and Computer Assisted Intervention) conference annually runs segmentation challenges covering liver, pancreas, cardiac, brain, and other organ segmentation tasks. Challenge datasets are labeled by clinical experts with specific annotation protocols — the quality is substantially higher than retrospective datasets labeled by NLP. They typically require registration and accept a non-commercial research license. Individual challenge organizers sometimes offer commercial licensing on request. For the specific segmentation task your model targets, finding the corresponding MICCAI challenge dataset and negotiating commercial access directly is often the most practical path.

MRI Datasets

BraTS (Brain Tumor Segmentation)

~1,000+ cases annually · Multi-modal MRI · Research DUA · Expert annotations
Multi-modal (T1, T2, FLAIR) Research license

BraTS is the standard benchmark for brain tumor segmentation, providing multi-modal MRI (T1, T1ce, T2, T2-FLAIR) with manual annotations of tumor sub-regions. The annotation protocol is rigorous — a consensus of neuroradiologists — and the dataset grows each year as more institutions contribute. Requires registration through the Synapse platform and a data use agreement. Commercial use is not covered by the standard DUA; teams building commercial neuro-imaging products typically need to source data through direct hospital partnerships or commercial data vendors.

Digital Pathology Datasets

TCGA (The Cancer Genome Atlas) Pathology Slides

30,000+ whole-slide images · 33 cancer types · NIH Data Use Agreement · Multi-institutional
Large scale, cancer type diversity Research DUA, no commercial use

TCGA is the largest publicly available repository of cancer genomics and digital pathology, containing whole-slide images (WSIs) from over 11,000 patients across 33 cancer types. The pathology slides are not uniformly annotated — most come with diagnostic labels (cancer type, grade) rather than detailed pixel-level annotations. Researchers often generate annotations from TCGA slides for specific tasks. NIH data use agreement is required; commercial use is not permitted under the standard DUA. For building commercial pathology AI products, institutional partnerships and licensed pathology data vendors are the standard path.

LabelSets Medical Imaging Datasets — Browse Medical Datasets

De-identified · Commercial license available · LQS quality-scored · Instant download where available
Commercial license Verified de-identification

LabelSets lists medical imaging datasets that have cleared commercial licensing review — de-identified, with verified consent or waiver documentation, and with explicit commercial use rights. These are not the same breadth as the research datasets above, and we want to be direct about that: the commercially licensable medical imaging market is thin. What's available covers specific modalities and conditions. Each listing documents the de-identification methodology, the labeling protocol, the annotator credentials (radiologist-labeled vs. AI-assisted with radiologist review), and the geographic coverage of the source data. For teams that need defensible commercial data today, this is the most transparent path. For teams willing to work through institutional partnerships, direct hospital data licensing typically unlocks more volume and modality coverage.

A Realistic View of the Commercial Medical Imaging Data Market

The honest picture: building commercial medical imaging AI is data-hard in a way that most other ML domains aren't. The reasons are structural:

This means that most serious commercial medical imaging AI companies either build data partnerships with hospital systems from the beginning, commission custom annotation projects through specialized medical AI data vendors, or operate in research mode (using DUA-restricted data) until they have the scale and institutional relationships to move to commercial data licensing.

Working with medical imaging data you've already collected or licensed? The free LQS quality audit tool checks annotation completeness, format compliance (DICOM and common export formats), and class distribution before you commit to a training run. No account required.