MIMIC-IV — a procurement-grade audit under the FDA 21 CFR 11 / HIPAA / §1557 lens.
Composite 93 / 100. Platinum tier — the first in the audit series. MIT Lab for Computational Physiology's clinical dataset is structurally what procurement-grade healthcare data should look like: credentialed access, IRB-waived under documented HIPAA Safe Harbor de-identification, Nature-published, full data dictionary, reproducible cohort definitions. Two procurement-relevant caveats every healthcare-AI buyer should know about. Open methodology, signed result.
RAG profile: 94 · classification profile: 91 · evaluation profile: 93
What we audited
| Dataset | MIMIC-IV (Medical Information Mart for Intensive Care) |
|---|---|
| Size | 315,000 patients · 454,000 hospital admissions · 73,000 ICU stays · 425,000 ED visits · ~2.3B chart events |
| Modality | Structured clinical (vitals, labs, meds, ICU events) + clinical free-text notes + radiology reports |
| Source institution | Beth Israel Deaconess Medical Center, Boston, MA |
| Time window | 2008–2019 (BIDMC admissions during this period) |
| License | PhysioNet Credentialed Health Data License 1.5.0 — credentialed-access only, non-commercial without separate agreement |
| Access prerequisite | CITI Data or Specimens Only Research training + signed DUA + verified credentialed status on PhysioNet |
| Maintainer | MIT Lab for Computational Physiology (Alistair Johnson, Tom Pollard, Roger Mark et al.) |
| Reference | Johnson et al. (2023). MIMIC-IV. Scientific Data, 10(1), 1. |
| Distribution format | CSV (gzipped) + Parquet, fully relational schema with documented foreign keys |
| Size on disk | ~67 GB compressed |
| De-identification | HIPAA Safe Harbor §164.514(b)(2): all 18 identifiers removed or surrogate-shifted, dates shifted within patient, ages capped at 90+ |
The headline finding
MIMIC-IV is what procurement-grade healthcare data looks like done right. The credentialed-access framework, the published HIPAA Safe Harbor de-identification, the IRB-waived data-use agreement, the Nature-published methodology, the reproducible cohort-definition code, and the multi-decade institutional commitment from MIT and Beth Israel Deaconess Medical Center together produce the highest LQS composite in the audit series to date — Platinum tier, 93/100. For most healthcare-AI procurement use cases, MIMIC-IV is the right anchor dataset to cite in model-risk paperwork.
That said, "Platinum" does not mean "use without further analysis." Two structural features should appear in any model card that cites MIMIC-IV training data — not as flaws but as procurement-relevant context.
- Single-site provenance. All 315K patients are from one urban academic medical center in Boston. BIDMC's patient mix skews Northeast US, predominantly insured under Massachusetts payer plans, with the specific clinical practice patterns of a large teaching hospital affiliated with Harvard Medical School. A model trained on MIMIC-IV and deployed in a community hospital in Texas, a rural clinic in Mississippi, an NHS facility in Manchester, or a community health center serving primarily uninsured patients is being deployed outside its training distribution. The shift is not subtle — drug formulary, billing-code distribution, lab-test ordering patterns, and patient demographics all vary materially. This isn't a flaw in MIMIC-IV; it's a procurement-relevant feature that affects how downstream models should be tested and represented to regulators under 21 CFR 11.10(a) (procedures for ensuring validity) and the EU AI Act Article 10 §3 (data governance, including representativeness).
- HIPAA Safe Harbor is rigorous but not GDPR-grade. The de-identification methodology applied to MIMIC-IV is the gold standard for US HIPAA compliance. For models intended for European deployment, additional safeguards may be required under GDPR Article 4(5) pseudonymization and Article 9 special-category-data restrictions. Free-text clinical notes are the highest residual risk: while MIT LCP applies a deidentification pipeline to notes specifically, residual indirect identifiers (rare diagnoses, distinctive event sequences, geographic context in narrative) cannot be perfectly removed at scale. The MIMIC team discloses this openly. For procurement in EU jurisdictions, expect to layer additional pseudonymization on top.
Why this audit exists. Healthcare-AI procurement teams under SR 11-7 (Federal Reserve model risk), 21 CFR 11.10 (FDA electronic records), HHS §1557 (algorithmic non-discrimination), GDPR Article 9, and increasingly NIST AI RMF 1.0 need an independent rating they can cite. MIMIC-IV is the most-cited credentialed clinical dataset in healthcare AI. Translating its existing documentation into a procurement-shaped artifact — with the same rubric used to score MMLU, FineWeb-Edu, and RedPajama-V2 — lets buyers compare across dataset categories using one framework.
Dimension-by-dimension reasoning
Maintainer reputation — 98
98 / 100MIT Lab for Computational Physiology under Roger Mark has maintained the MIMIC series since the 1990s. Multi-decade institutional commitment. Beth Israel Deaconess Medical Center IRB engagement is sustained. Public errata and version history are responsive (MIMIC-III → IV → IV.v2 → IV.v3 transitions are documented). The Johnson et al. authorship line is exceptionally credible. This is the highest maintainer score the audit series has ever produced. The 2-point deduction is reserved entirely for the possibility of a future maintenance transition; nothing in the maintainer behavior itself is below the bar.
Completeness — 96
96 / 100Full data dictionary published. Schema documentation for every table. Patient-level admission journey is reconstructable from the relational structure. Lab values, vital signs, medications, procedures, diagnoses (ICD-9-CM and ICD-10-CM), notes (where available), and ICU events are all present. Missing-data patterns are reported per table in the documentation. The 4-point deduction reflects the inherent incompleteness of any real-world clinical record: some patients have ED-only encounters with no structured ICU data; some admissions lack discharge summaries; some chart events have value-but-not-unit fields.
Format compliance — 95
95 / 100CSV (gzipped) is the canonical distribution; Parquet versions are published for direct DuckDB and Polars loading. Schema is documented with foreign-key relationships across tables. PostgreSQL and BigQuery setup scripts are maintained by the MIT LCP team. Loads cleanly via pandas, polars, dask, R's data.table. The 5-point deduction is for the absence of a published OMOP CDM mapping in the canonical release (third-party OMOP mappings exist but are not maintainer-published) — relevant for procurement workflows that have standardized on OMOP for cross-dataset comparability.
Documentation — 95
95 / 100Nature Scientific Data paper (2023). Detailed cohort-derivation code in mit-lcp/mimic-code repo. Discussion forum with archived maintainer responses. Per-table documentation on PhysioNet. Datasheet-for-datasets-style coverage is implicit rather than explicit (the Nature paper covers most fields but doesn't use the Gebru et al. datasheet template). 5-point deduction reserved for the absence of a formal datasheet — minor, since the equivalent content exists in the paper.
Consent + IRB framework — 95
95 / 100BIDMC IRB approved a waiver of informed consent under HIPAA §164.512(i) for the deidentified data and §164.514(b)(2) Safe Harbor for the release. Documented in the methodology paper. MIT IRB endorsed the secondary distribution. The credentialed-access model (CITI training + signed DUA + verified credentialed user status) operationalizes ongoing consent-equivalent oversight: each researcher attests to the use case, agrees to non-redistribution and non-re-identification, and is auditable by PhysioNet. This is the procurement-grade framework other healthcare datasets should be measured against. 5-point deduction reflects the inherent asymmetry of any waiver-of-consent framework — patients cannot opt out retroactively even if they could in principle have been asked at the time of care.
Validation / record fidelity — 93
93 / 100Source-of-truth is the underlying electronic health record itself — lab results, billing codes, medication orders, vitals as recorded by clinical staff at point of care. Validation is in-context (the record is the ground truth). Residual noise is documented: transcription errors in nursing flowsheet entries, occasional duplicate medication-administration records, free-text fields with abbreviations the standard medical NLP pipelines may misread. The maintainers publish version-to-version errata for downstream fixes. 7-point deduction reflects the irreducible noise of real clinical record-keeping (which is itself a real-world signal, not a data flaw).
Provenance chain — 92
92 / 100BIDMC EHR (Epic) → BIDMC clinical data warehouse → MIT LCP secondary-use pipeline → de-identification → PhysioNet distribution. Each hop is documented. The pipeline is reproducible from the published code. Schema mappings between the source EHR and the released schema are documented. Compared to web-crawl-derived datasets where the provenance chain is "Common Crawl → filter → publish" with millions of unique upstream sources, MIMIC-IV's single-institution chain is far more auditable. 8-point deduction reserved for the BIDMC EHR upstream layer itself, which is closed by definition (Epic is proprietary) — not actionable, but a fact procurement audits should note.
De-identification quality — 88
88 / 100HIPAA Safe Harbor methodology under §164.514(b)(2): all 18 identifiers removed or surrogate-shifted, dates shifted within patient, ages capped at 90+ to avoid small-cell-population re-identification risk. For structured fields, this is the gold-standard methodology. The deduction is concentrated in two places: (a) free-text clinical notes, where automated PHI removal (the MIT LCP pipeline uses a regex + rule-based scrubber for notes) cannot perfectly eliminate every indirect identifier — rare disease descriptions, distinctive event sequences, contextual geographic clues; (b) the de-identification standard is HIPAA-grade, which is the right standard for US deployment but not equivalent to GDPR Article 4(5) pseudonymization for EU deployment. Procurement teams shipping models into EU jurisdictions should layer additional pseudonymization or use a GDPR-grade derivative.
Reproducibility — 88
88 / 100Cohort-definition code is public (mit-lcp/mimic-code repo). Standard analysis scripts (mortality prediction, length-of-stay, sepsis cohort) are published with the dataset. The de-identification pipeline is described in detail though not fully open-source — the rule-base for note scrubbing is partially proprietary to MIT LCP. Anyone with credentialed access can fully reproduce published downstream analyses; the de-identification step itself has a closed component. 12-point deduction reflects this.
License clarity (unusual) — 70
70 / 100The PhysioNet Credentialed Health Data License 1.5.0 is unambiguous in its terms but procurement-unusual: not an open license, not a commercial license, but a credentialed-access framework specific to healthcare research data. Researchers must complete CITI Data or Specimens Only Research training, attest to a specific use case, sign the data-use agreement, and undergo credentialed-user verification on PhysioNet. Commercial use requires a separate agreement directly with MIT LCP. For procurement teams accustomed to scanning for Apache 2.0 / MIT / CC-BY / proprietary, this falls outside the standard taxonomy. 30-point deduction is not a quality complaint — it reflects that downstream procurement automation has to handle this case specially. The license terms themselves are appropriate for the data category.
Population coverage — 60
60 / 100Single-site Boston urban academic medical center. Patient population skews Northeast US (BIDMC catchment area), predominantly insured under Massachusetts payer plans (Mass General Brigham, BCBS-MA, Medicare, Medicaid), with the specific demographic mix of an Alewife-to-Mattapan catchment that does not represent the US population, much less the world. Time window 2008–2019. For models intended for nationwide US deployment, this is the largest dimension gap. The data is not "biased" in a moral sense — it accurately reflects the population BIDMC serves. The procurement question is whether your downstream deployment population matches.
Cross-site generalisability — 55
55 / 100BIDMC's clinical practice patterns are not portable. Drug formulary (which brand of an antibiotic is on formulary at BIDMC vs. a community hospital), ordering patterns (which tests are routinely ordered for sepsis workup at a teaching hospital vs. a rural ED), billing-code application (which ICD-10 codes are routinely applied for the same clinical scenario at BIDMC vs. elsewhere), and EHR free-text conventions (BIDMC's documentation templates and shorthand) all create a site-specific signature. Models that perform well on MIMIC-IV held-out test set often degrade on data from other institutions — this is a well-documented phenomenon in the clinical-AI literature. For FDA SaMD pre-cert and 510(k) submissions, multi-site external validation is essentially required regardless of MIMIC-IV training performance.
Procurement profile — what this means for healthcare-AI buyers
- For US-only model deployment under SR 11-7, 21 CFR 11, HHS §1557: 93 (clinical profile). Strong fit. Cite MIMIC-IV training data with the BIDMC single-site caveat in the model card. Plan multi-site external validation before regulatory submission.
- For EU model deployment under GDPR Article 9 + EU AI Act Annex III: Marginal as-is. Layer additional pseudonymization on free-text notes. Consider a GDPR-grade synthetic derivative for fine-tuning, with MIMIC-IV as the held-out validation set.
- For nationwide US clinical decision support tools: Train on MIMIC-IV, validate externally on at least two additional institutions before deployment. Document the validation results in the model card.
- For commercial fine-tuned clinical LLMs: Negotiate a commercial agreement with MIT LCP. The default PhysioNet license is non-commercial. Many commercial healthcare-AI vendors do this routinely; the process is established.
- For academic research: 94. Excellent fit. The credentialing process exists for a reason and the research community has internalized it. Reproducibility is high.
Comparison to the audit series so far
| Report | Dataset | LQS |
|---|---|---|
| 006 · Healthcare | MIMIC-IV (clinical) | 93 / Platinum |
| 002 · Benchmark | MMLU (LLM eval) | 83 / Gold |
| 003 · Pretraining | RedPajama-V2 (30T tokens) | 81 / Gold |
| 001 · Pretraining | FineWeb-Edu (1.3T tokens) | 73 / Silver |
MIMIC-IV is the first Platinum-tier dataset in the audit series. The structural reason: healthcare research data has spent forty years building the procurement-grade machinery (IRB, HIPAA, DUAs, credentialed access, version errata) that web-scale pretraining corpora are still figuring out. None of the procurement features that make MIMIC-IV Platinum are novel — they're standard practice in academic medicine. They're new to ML.
Methodology
This audit was scored under LQS v3.1 with the clinical-data adapter — the same 19-dimension rubric used for the other audits, weighted to surface healthcare-procurement-relevant dimensions (consent + IRB framework, de-identification quality, cross-site generalisability) more heavily. Every dimension maps to a documented rubric in the methodology preprint (DOI 10.5281/zenodo.20278981).
The 7-oracle consensus pass was not run for this report — MIMIC-IV is gated by credentialed access, which the LabelSets evaluation infrastructure does not have for this audit. The audit is metadata- and publication-based, same lens used for the other reports. For maintainers wanting full oracle-cert results on a representative MIMIC-IV slice (via existing credentialed researchers), contact us.
Recourse. If you maintain MIMIC-IV or are otherwise authorized to speak for MIT LCP and believe any score here is wrong, the recourse process is documented in the methodology preprint §7. File an issue at the public-audit repo with a counter-citation; we publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure. Every published score carries an immutable cert hash; corrections are issued as new versions.
What this audit doesn't claim
- It does not claim MIMIC-IV is the right dataset for every healthcare-AI use case. Cardiology-specific, oncology-specific, dermatology-imaging, and pediatric use cases require domain-specific data; MIMIC-IV is general adult ICU/ED. Use the right corpus.
- It does not claim Platinum means "no further review." Platinum means the underlying data is procurement-grade. Model-risk teams still need to validate downstream, document the model card, and follow domain-specific regulatory pathways (FDA SaMD, CMS clinical workflow integration, state medical board attestations, etc.).
- It does not endorse any specific commercial model trained on MIMIC-IV. Training corpus quality is one input to model quality. Many high-LQS-trained models are bad models for reasons independent of training data.
- It does not adjudicate the GDPR-equivalence question. US HIPAA Safe Harbor is not legally equivalent to GDPR pseudonymization. Whether a specific MIMIC-IV use case is GDPR-compliant for EU deployment is a question for counsel.
What's next
This is Report 006 in the public-audit series and the first under the healthcare procurement lens. Coming up:
- Report 004 — The Pile (EleutherAI). Foundational. Known Books3 / copyright surface. The most-litigated open pretraining corpus. Queued.
- Report 005 — HumanEval (OpenAI). Code benchmark. Different contamination profile. Queued.
- Report 007 — A radiology imaging corpus (TBD — likely CheXpert or MIMIC-CXR). First imaging-modality audit under the FDA SaMD lens.
- Contamination Report 002. Pretraining-corpus contamination scan against the 40+ benchmark fingerprints. The artifact that backs every "contamination cleanliness" score in the audits above.
Want the next healthcare audit when it lands?
One email per audit. No marketing. Methodology updates included. Healthcare procurement audits tagged separately in the feed.
All audits + signup Read the methodology (DOI)