Public LQS Audit · Report 006 · Healthcare

MIMIC-IV — a procurement-grade audit under the FDA 21 CFR 11 / HIPAA / §1557 lens.

Composite 93 / 100. Platinum tier — the first in the audit series. MIT Lab for Computational Physiology's clinical dataset is structurally what procurement-grade healthcare data should look like: credentialed access, IRB-waived under documented HIPAA Safe Harbor de-identification, Nature-published, full data dictionary, reproducible cohort definitions. Two procurement-relevant caveats every healthcare-AI buyer should know about. Open methodology, signed result.

Published May 19, 2026 · LabelSets Research · 12 min read · Author: Alex Adrion
93 / 100
Platinum
LQS v3.1 composite · clinical profile
RAG profile: 94 · classification profile: 91 · evaluation profile: 93
Maintainer reputation
98
Completeness
96
Format compliance
95
Documentation
95
Consent + IRB framework
95
Validation / record fidelity
93
Provenance chain
92
De-identification quality
88
Reproducibility
88
License clarity (unusual)
70
Population coverage
60
Cross-site generalisability
55

What we audited

DatasetMIMIC-IV (Medical Information Mart for Intensive Care)
Size315,000 patients · 454,000 hospital admissions · 73,000 ICU stays · 425,000 ED visits · ~2.3B chart events
ModalityStructured clinical (vitals, labs, meds, ICU events) + clinical free-text notes + radiology reports
Source institutionBeth Israel Deaconess Medical Center, Boston, MA
Time window2008–2019 (BIDMC admissions during this period)
LicensePhysioNet Credentialed Health Data License 1.5.0 — credentialed-access only, non-commercial without separate agreement
Access prerequisiteCITI Data or Specimens Only Research training + signed DUA + verified credentialed status on PhysioNet
MaintainerMIT Lab for Computational Physiology (Alistair Johnson, Tom Pollard, Roger Mark et al.)
ReferenceJohnson et al. (2023). MIMIC-IV. Scientific Data, 10(1), 1.
Distribution formatCSV (gzipped) + Parquet, fully relational schema with documented foreign keys
Size on disk~67 GB compressed
De-identificationHIPAA Safe Harbor §164.514(b)(2): all 18 identifiers removed or surrogate-shifted, dates shifted within patient, ages capped at 90+

The headline finding

MIMIC-IV is what procurement-grade healthcare data looks like done right. The credentialed-access framework, the published HIPAA Safe Harbor de-identification, the IRB-waived data-use agreement, the Nature-published methodology, the reproducible cohort-definition code, and the multi-decade institutional commitment from MIT and Beth Israel Deaconess Medical Center together produce the highest LQS composite in the audit series to date — Platinum tier, 93/100. For most healthcare-AI procurement use cases, MIMIC-IV is the right anchor dataset to cite in model-risk paperwork.

That said, "Platinum" does not mean "use without further analysis." Two structural features should appear in any model card that cites MIMIC-IV training data — not as flaws but as procurement-relevant context.

  1. Single-site provenance. All 315K patients are from one urban academic medical center in Boston. BIDMC's patient mix skews Northeast US, predominantly insured under Massachusetts payer plans, with the specific clinical practice patterns of a large teaching hospital affiliated with Harvard Medical School. A model trained on MIMIC-IV and deployed in a community hospital in Texas, a rural clinic in Mississippi, an NHS facility in Manchester, or a community health center serving primarily uninsured patients is being deployed outside its training distribution. The shift is not subtle — drug formulary, billing-code distribution, lab-test ordering patterns, and patient demographics all vary materially. This isn't a flaw in MIMIC-IV; it's a procurement-relevant feature that affects how downstream models should be tested and represented to regulators under 21 CFR 11.10(a) (procedures for ensuring validity) and the EU AI Act Article 10 §3 (data governance, including representativeness).
  2. HIPAA Safe Harbor is rigorous but not GDPR-grade. The de-identification methodology applied to MIMIC-IV is the gold standard for US HIPAA compliance. For models intended for European deployment, additional safeguards may be required under GDPR Article 4(5) pseudonymization and Article 9 special-category-data restrictions. Free-text clinical notes are the highest residual risk: while MIT LCP applies a deidentification pipeline to notes specifically, residual indirect identifiers (rare diagnoses, distinctive event sequences, geographic context in narrative) cannot be perfectly removed at scale. The MIMIC team discloses this openly. For procurement in EU jurisdictions, expect to layer additional pseudonymization on top.

Why this audit exists. Healthcare-AI procurement teams under SR 11-7 (Federal Reserve model risk), 21 CFR 11.10 (FDA electronic records), HHS §1557 (algorithmic non-discrimination), GDPR Article 9, and increasingly NIST AI RMF 1.0 need an independent rating they can cite. MIMIC-IV is the most-cited credentialed clinical dataset in healthcare AI. Translating its existing documentation into a procurement-shaped artifact — with the same rubric used to score MMLU, FineWeb-Edu, and RedPajama-V2 — lets buyers compare across dataset categories using one framework.

Dimension-by-dimension reasoning

Maintainer reputation — 98

98 / 100

MIT Lab for Computational Physiology under Roger Mark has maintained the MIMIC series since the 1990s. Multi-decade institutional commitment. Beth Israel Deaconess Medical Center IRB engagement is sustained. Public errata and version history are responsive (MIMIC-III → IV → IV.v2 → IV.v3 transitions are documented). The Johnson et al. authorship line is exceptionally credible. This is the highest maintainer score the audit series has ever produced. The 2-point deduction is reserved entirely for the possibility of a future maintenance transition; nothing in the maintainer behavior itself is below the bar.

Sources: MIT LCP publication history · PhysioNet errata log · IRB documentation in Section 2 of the Johnson et al. 2023 paper

Completeness — 96

96 / 100

Full data dictionary published. Schema documentation for every table. Patient-level admission journey is reconstructable from the relational structure. Lab values, vital signs, medications, procedures, diagnoses (ICD-9-CM and ICD-10-CM), notes (where available), and ICU events are all present. Missing-data patterns are reported per table in the documentation. The 4-point deduction reflects the inherent incompleteness of any real-world clinical record: some patients have ED-only encounters with no structured ICU data; some admissions lack discharge summaries; some chart events have value-but-not-unit fields.

Sources: MIMIC-IV data dictionary · Johnson et al. 2023 Section 3 · PhysioNet table-level documentation

Format compliance — 95

95 / 100

CSV (gzipped) is the canonical distribution; Parquet versions are published for direct DuckDB and Polars loading. Schema is documented with foreign-key relationships across tables. PostgreSQL and BigQuery setup scripts are maintained by the MIT LCP team. Loads cleanly via pandas, polars, dask, R's data.table. The 5-point deduction is for the absence of a published OMOP CDM mapping in the canonical release (third-party OMOP mappings exist but are not maintainer-published) — relevant for procurement workflows that have standardized on OMOP for cross-dataset comparability.

Sources: Build scripts repo (mit-lcp/mimic-iv) · third-party OMOP mappings · independent load test

Documentation — 95

95 / 100

Nature Scientific Data paper (2023). Detailed cohort-derivation code in mit-lcp/mimic-code repo. Discussion forum with archived maintainer responses. Per-table documentation on PhysioNet. Datasheet-for-datasets-style coverage is implicit rather than explicit (the Nature paper covers most fields but doesn't use the Gebru et al. datasheet template). 5-point deduction reserved for the absence of a formal datasheet — minor, since the equivalent content exists in the paper.

Sources: Nature paper 10.1038/s41597-022-01899-x · mit-lcp/mimic-code repo · PhysioNet discussion archive

Consent + IRB framework — 95

95 / 100

BIDMC IRB approved a waiver of informed consent under HIPAA §164.512(i) for the deidentified data and §164.514(b)(2) Safe Harbor for the release. Documented in the methodology paper. MIT IRB endorsed the secondary distribution. The credentialed-access model (CITI training + signed DUA + verified credentialed user status) operationalizes ongoing consent-equivalent oversight: each researcher attests to the use case, agrees to non-redistribution and non-re-identification, and is auditable by PhysioNet. This is the procurement-grade framework other healthcare datasets should be measured against. 5-point deduction reflects the inherent asymmetry of any waiver-of-consent framework — patients cannot opt out retroactively even if they could in principle have been asked at the time of care.

Sources: Johnson et al. 2023 Section 2 (Ethical considerations) · BIDMC IRB protocol number cited in paper · PhysioNet credentialing process documentation

Validation / record fidelity — 93

93 / 100

Source-of-truth is the underlying electronic health record itself — lab results, billing codes, medication orders, vitals as recorded by clinical staff at point of care. Validation is in-context (the record is the ground truth). Residual noise is documented: transcription errors in nursing flowsheet entries, occasional duplicate medication-administration records, free-text fields with abbreviations the standard medical NLP pipelines may misread. The maintainers publish version-to-version errata for downstream fixes. 7-point deduction reflects the irreducible noise of real clinical record-keeping (which is itself a real-world signal, not a data flaw).

Sources: PhysioNet errata log · third-party validation studies cited in Johnson et al. 2023 Section 4 · LQS validation-health rubric for real-world clinical data

Provenance chain — 92

92 / 100

BIDMC EHR (Epic) → BIDMC clinical data warehouse → MIT LCP secondary-use pipeline → de-identification → PhysioNet distribution. Each hop is documented. The pipeline is reproducible from the published code. Schema mappings between the source EHR and the released schema are documented. Compared to web-crawl-derived datasets where the provenance chain is "Common Crawl → filter → publish" with millions of unique upstream sources, MIMIC-IV's single-institution chain is far more auditable. 8-point deduction reserved for the BIDMC EHR upstream layer itself, which is closed by definition (Epic is proprietary) — not actionable, but a fact procurement audits should note.

Sources: MIT LCP pipeline documentation · mit-lcp/mimic-code repository · BIDMC Epic configuration (closed, not publicly reproducible)

De-identification quality — 88

88 / 100

HIPAA Safe Harbor methodology under §164.514(b)(2): all 18 identifiers removed or surrogate-shifted, dates shifted within patient, ages capped at 90+ to avoid small-cell-population re-identification risk. For structured fields, this is the gold-standard methodology. The deduction is concentrated in two places: (a) free-text clinical notes, where automated PHI removal (the MIT LCP pipeline uses a regex + rule-based scrubber for notes) cannot perfectly eliminate every indirect identifier — rare disease descriptions, distinctive event sequences, contextual geographic clues; (b) the de-identification standard is HIPAA-grade, which is the right standard for US deployment but not equivalent to GDPR Article 4(5) pseudonymization for EU deployment. Procurement teams shipping models into EU jurisdictions should layer additional pseudonymization or use a GDPR-grade derivative.

Sources: HIPAA §164.514(b)(2) · Johnson et al. 2023 Section 2.3 · GDPR Article 4(5) and Article 9 · LQS PII-residual-risk rubric for clinical free-text

Reproducibility — 88

88 / 100

Cohort-definition code is public (mit-lcp/mimic-code repo). Standard analysis scripts (mortality prediction, length-of-stay, sepsis cohort) are published with the dataset. The de-identification pipeline is described in detail though not fully open-source — the rule-base for note scrubbing is partially proprietary to MIT LCP. Anyone with credentialed access can fully reproduce published downstream analyses; the de-identification step itself has a closed component. 12-point deduction reflects this.

Sources: mit-lcp/mimic-code repo · published cohort analyses · de-identification pipeline description in Johnson et al. 2023 Section 2.3

License clarity (unusual) — 70

70 / 100

The PhysioNet Credentialed Health Data License 1.5.0 is unambiguous in its terms but procurement-unusual: not an open license, not a commercial license, but a credentialed-access framework specific to healthcare research data. Researchers must complete CITI Data or Specimens Only Research training, attest to a specific use case, sign the data-use agreement, and undergo credentialed-user verification on PhysioNet. Commercial use requires a separate agreement directly with MIT LCP. For procurement teams accustomed to scanning for Apache 2.0 / MIT / CC-BY / proprietary, this falls outside the standard taxonomy. 30-point deduction is not a quality complaint — it reflects that downstream procurement automation has to handle this case specially. The license terms themselves are appropriate for the data category.

Sources: PhysioNet license 1.5.0 text · MIT LCP commercial-use process · LQS license-clarity rubric for credentialed health data

Population coverage — 60

60 / 100

Single-site Boston urban academic medical center. Patient population skews Northeast US (BIDMC catchment area), predominantly insured under Massachusetts payer plans (Mass General Brigham, BCBS-MA, Medicare, Medicaid), with the specific demographic mix of an Alewife-to-Mattapan catchment that does not represent the US population, much less the world. Time window 2008–2019. For models intended for nationwide US deployment, this is the largest dimension gap. The data is not "biased" in a moral sense — it accurately reflects the population BIDMC serves. The procurement question is whether your downstream deployment population matches.

Sources: BIDMC catchment-area documentation · Johnson et al. 2023 Section 4.1 (limitations) · LQS population-coverage rubric for clinical data

Cross-site generalisability — 55

55 / 100

BIDMC's clinical practice patterns are not portable. Drug formulary (which brand of an antibiotic is on formulary at BIDMC vs. a community hospital), ordering patterns (which tests are routinely ordered for sepsis workup at a teaching hospital vs. a rural ED), billing-code application (which ICD-10 codes are routinely applied for the same clinical scenario at BIDMC vs. elsewhere), and EHR free-text conventions (BIDMC's documentation templates and shorthand) all create a site-specific signature. Models that perform well on MIMIC-IV held-out test set often degrade on data from other institutions — this is a well-documented phenomenon in the clinical-AI literature. For FDA SaMD pre-cert and 510(k) submissions, multi-site external validation is essentially required regardless of MIMIC-IV training performance.

Sources: published cross-site validation literature (e.g. Sendak et al., Wong et al.) · FDA SaMD guidance on external validation · LQS generalisability rubric

Procurement profile — what this means for healthcare-AI buyers

Comparison to the audit series so far

ReportDatasetLQS
006 · HealthcareMIMIC-IV (clinical)93 / Platinum
002 · BenchmarkMMLU (LLM eval)83 / Gold
003 · PretrainingRedPajama-V2 (30T tokens)81 / Gold
001 · PretrainingFineWeb-Edu (1.3T tokens)73 / Silver

MIMIC-IV is the first Platinum-tier dataset in the audit series. The structural reason: healthcare research data has spent forty years building the procurement-grade machinery (IRB, HIPAA, DUAs, credentialed access, version errata) that web-scale pretraining corpora are still figuring out. None of the procurement features that make MIMIC-IV Platinum are novel — they're standard practice in academic medicine. They're new to ML.

Methodology

This audit was scored under LQS v3.1 with the clinical-data adapter — the same 19-dimension rubric used for the other audits, weighted to surface healthcare-procurement-relevant dimensions (consent + IRB framework, de-identification quality, cross-site generalisability) more heavily. Every dimension maps to a documented rubric in the methodology preprint (DOI 10.5281/zenodo.20278981).

The 7-oracle consensus pass was not run for this report — MIMIC-IV is gated by credentialed access, which the LabelSets evaluation infrastructure does not have for this audit. The audit is metadata- and publication-based, same lens used for the other reports. For maintainers wanting full oracle-cert results on a representative MIMIC-IV slice (via existing credentialed researchers), contact us.

Recourse. If you maintain MIMIC-IV or are otherwise authorized to speak for MIT LCP and believe any score here is wrong, the recourse process is documented in the methodology preprint §7. File an issue at the public-audit repo with a counter-citation; we publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure. Every published score carries an immutable cert hash; corrections are issued as new versions.

What this audit doesn't claim

What's next

This is Report 006 in the public-audit series and the first under the healthcare procurement lens. Coming up:

Want the next healthcare audit when it lands?

One email per audit. No marketing. Methodology updates included. Healthcare procurement audits tagged separately in the feed.

All audits + signup Read the methodology (DOI)
Share on X Share on LinkedIn Share on r/ML Share on HN