Medical imaging AI sits at the intersection of two difficult problems: the technical challenge of training models on limited, expensive-to-annotate data, and the regulatory challenge of navigating patient privacy requirements, institutional review, and commercial licensing in a domain where the stakes of a wrong prediction are high. This guide is honest about both.
We cover the compliance requirements you need to understand before using any medical imaging dataset, the major imaging modalities and what's available for each, and the distinction between datasets that require institutional access and those with truly open or commercial licenses.
Compliance First: HIPAA, IRB, and De-identification
Before looking at any specific dataset, understand the compliance landscape. This matters whether you're at a startup, an established medical device company, or an academic institution.
- HIPAA (Health Insurance Portability and Accountability Act) — US law governing protected health information (PHI). Applies to covered entities (healthcare providers, insurers) and their business associates. Training data derived from patient records — including medical images with associated clinical metadata — is subject to HIPAA if it's sourced from covered entities. The key requirement: data must be de-identified before use in research or commercial applications, either through expert determination or the Safe Harbor method (removing 18 specific identifiers).
- De-identification in imaging — DICOM files (the standard medical imaging format) embed patient metadata in header fields. A "de-identified" medical imaging dataset should have all DICOM headers scrubbed, burned-in text removed from images (common in screenshots and some modalities), and any linked metadata (radiology reports, clinical notes) anonymized separately. Don't assume a dataset is de-identified because the files are labeled as such — verify the methodology.
- IRB (Institutional Review Board) approval — Research involving human subjects (including retrospective use of patient data) typically requires IRB approval at the institution where the data originated. Many publicly available medical imaging datasets come with an associated IRB protocol number — this documents that the original data collection was approved. Using the data for commercial purposes may require additional agreements beyond what the IRB covered.
- FDA considerations — If your model is intended for clinical decision support or medical device software (Software as a Medical Device / SaMD), the FDA's AI/ML-based SaMD guidance and 510(k) / De Novo pathways become relevant. Training data quality and documentation are part of regulatory submissions.
The practical upshot: for research and internal development, many publicly available datasets are accessible after a data use agreement (DUA). For commercial deployment, the requirements are stricter — you typically need data with explicit commercial use rights, or you need to license data directly from the originating institution. We flag which category each dataset falls into below.
X-ray Datasets
NIH Chest X-Ray14 (ChestXray14)
Large scale Research only, DUA requiredChestXray14 is one of the largest publicly available chest X-ray datasets, with 112,000 frontal-view X-rays from 30,000 unique patients labeled for 14 thoracic pathologies (pneumonia, effusion, pneumothorax, etc.). Labels were derived programmatically from radiology reports using NLP — not hand-labeled by radiologists. This creates a significant caveat: label accuracy is estimated at 80–85% for some conditions and lower for others. This is documented by the NIH itself. Models trained on ChestXray14 perform well by benchmark metrics but have been shown to have lower real-world accuracy than benchmark numbers suggest. Requires a data use agreement with NIH; not available for commercial product deployment without additional licensing.
CheXpert (Stanford)
Uncertainty labels Research DUA requiredCheXpert improves on ChestXray14 in one important way: it explicitly models label uncertainty. Because chest X-ray labeling is genuinely ambiguous (a radiologist may note "possible pneumonia" rather than a binary positive/negative), CheXpert labels include a U (uncertain) class for conditions where the report is ambiguous. This is more realistic than forcing binary labels on uncertain cases. The dataset has 224,000 X-rays with labels derived from radiology reports. Requires registration and a data use agreement with Stanford; commercial licensing is handled separately and requires direct engagement with Stanford OTL (Office of Technology Licensing).
CT Scan Datasets
LIDC-IDRI (Lung Image Database Consortium)
Multi-annotator consensus CC BY 3.0LIDC-IDRI is among the best-annotated publicly available medical imaging datasets. Each CT scan was independently annotated by four radiologists, who marked and characterized pulmonary nodules — giving you inter-annotator agreement data that's rare in medical imaging. The CC BY 3.0 license is more permissive than most medical imaging datasets, allowing commercial use with attribution. The main limitation is size — 1,018 CT scans is small for modern deep learning without augmentation and transfer learning. But the annotation quality and multi-reader design make it valuable as a validation set and fine-tuning target even when pre-training on larger, noisier datasets.
CT Medical Image Segmentation Challenge Datasets (MICCAI)
Expert annotations Research use, varies by challengeThe MICCAI (Medical Image Computing and Computer Assisted Intervention) conference annually runs segmentation challenges covering liver, pancreas, cardiac, brain, and other organ segmentation tasks. Challenge datasets are labeled by clinical experts with specific annotation protocols — the quality is substantially higher than retrospective datasets labeled by NLP. They typically require registration and accept a non-commercial research license. Individual challenge organizers sometimes offer commercial licensing on request. For the specific segmentation task your model targets, finding the corresponding MICCAI challenge dataset and negotiating commercial access directly is often the most practical path.
MRI Datasets
BraTS (Brain Tumor Segmentation)
Multi-modal (T1, T2, FLAIR) Research licenseBraTS is the standard benchmark for brain tumor segmentation, providing multi-modal MRI (T1, T1ce, T2, T2-FLAIR) with manual annotations of tumor sub-regions. The annotation protocol is rigorous — a consensus of neuroradiologists — and the dataset grows each year as more institutions contribute. Requires registration through the Synapse platform and a data use agreement. Commercial use is not covered by the standard DUA; teams building commercial neuro-imaging products typically need to source data through direct hospital partnerships or commercial data vendors.
Digital Pathology Datasets
TCGA (The Cancer Genome Atlas) Pathology Slides
Large scale, cancer type diversity Research DUA, no commercial useTCGA is the largest publicly available repository of cancer genomics and digital pathology, containing whole-slide images (WSIs) from over 11,000 patients across 33 cancer types. The pathology slides are not uniformly annotated — most come with diagnostic labels (cancer type, grade) rather than detailed pixel-level annotations. Researchers often generate annotations from TCGA slides for specific tasks. NIH data use agreement is required; commercial use is not permitted under the standard DUA. For building commercial pathology AI products, institutional partnerships and licensed pathology data vendors are the standard path.
LabelSets Medical Imaging Datasets — Browse Medical Datasets
Commercial license Verified de-identificationLabelSets lists medical imaging datasets that have cleared commercial licensing review — de-identified, with verified consent or waiver documentation, and with explicit commercial use rights. These are not the same breadth as the research datasets above, and we want to be direct about that: the commercially licensable medical imaging market is thin. What's available covers specific modalities and conditions. Each listing documents the de-identification methodology, the labeling protocol, the annotator credentials (radiologist-labeled vs. AI-assisted with radiologist review), and the geographic coverage of the source data. For teams that need defensible commercial data today, this is the most transparent path. For teams willing to work through institutional partnerships, direct hospital data licensing typically unlocks more volume and modality coverage.
A Realistic View of the Commercial Medical Imaging Data Market
The honest picture: building commercial medical imaging AI is data-hard in a way that most other ML domains aren't. The reasons are structural:
- Patient consent and HIPAA create real constraints on data sharing — not bureaucratic obstacles, but genuine protections for people whose health data is being used. Respecting this is both a legal requirement and the right thing to do.
- Expert annotation is expensive and scarce. A radiologist who can annotate chest CT scans is billing $300–$500/hour for their clinical time. Building a large, expert-annotated dataset is costly in ways that general CV annotation isn't.
- Institutional data is controlled by hospitals and health systems that have their own legal teams, IRBs, and policies. Getting commercial data rights out of an institution is a multi-step process that takes months.
This means that most serious commercial medical imaging AI companies either build data partnerships with hospital systems from the beginning, commission custom annotation projects through specialized medical AI data vendors, or operate in research mode (using DUA-restricted data) until they have the scale and institutional relationships to move to commercial data licensing.
Working with medical imaging data you've already collected or licensed? The free LQS quality audit tool checks annotation completeness, format compliance (DICOM and common export formats), and class distribution before you commit to a training run. No account required.