HumanEval — a procurement-grade audit of the code benchmark every model card cites.
Composite 83 / 100. Gold tier. 164 hand-crafted Python problems. The cleanest construction of any LLM benchmark we've audited — every problem is unit-tested for correctness rather than vote-tallied — but the dataset's own design choices make it inadequate as a sole capability signal. Two procurement-relevant ceilings: a sample-size confidence interval wider than the gaps between leading models, and a contamination surface that grows every month The Stack and GitHub crawls do. Open methodology, signed result.
evaluation profile: 80 · classification profile: 82 · RAG profile: 85
What we audited
| Dataset | openai/openai_humaneval |
|---|---|
| Size | 164 hand-crafted Python problems |
| Structure | Each problem: function signature + docstring + reference solution + unit test suite (avg. 7.7 tests per problem) |
| Modality | Code generation (Python only) |
| License | MIT — permissive, commercial-friendly, unambiguous |
| Maintainer | OpenAI (Mark Chen, Jerry Tworek, Heewoo Jun et al.) |
| Paper | arXiv:2107.03374 (Codex paper, 2021) |
| Year released | 2021 |
| Evaluation metric | pass@k — fraction of problems with at least one passing solution in k samples |
| Distribution format | JSONL via HuggingFace, also available in original GitHub release |
| Citations (Google Scholar) | 4,500+ — the standard code-generation benchmark in LLM research |
The headline finding
HumanEval was constructed correctly. Every problem has a reference solution and an executable unit-test suite, so "correctness" means "the generated code passes the tests" — not "the generated code matches the reference exactly." This is procurement-grade construction: the validation function is mechanical, not crowdsourced. It's why we score test-driven validation at 94 and label noise as effectively a non-issue. Compare this to MMLU, where label noise was the second-biggest concern.
But construction-quality alone doesn't determine procurement-fitness. Two structural features mean HumanEval's reported scores carry less signal than the precision of the numbers implies.
- 164 problems is a small sample. A 95% binomial confidence interval at 80% pass@1 is roughly ±6.1 percentage points (Wilson interval). A model reported at 87% and a model reported at 84% have overlapping 95% intervals; the gap is smaller than the noise floor introduced by the sample size. Procurement teams comparing models on the basis of "Model A scored 91% on HumanEval, Model B scored 88%" are reading more signal into the numbers than the sample supports. The benchmark itself does not control for this — it simply reports pass@1, pass@10, pass@100 without intervals. We score sample-size adequacy at 42.
- Contamination is asymmetric and growing. Code benchmarks have a unique contamination profile: the reference solutions get scraped into downstream training corpora like The Stack, GitHub crawls, code-generation datasets, and various RLHF preference sets. The LabelSets contamination scanner finds HumanEval in 11 of 80 scanned post-training datasets at non-zero similarity (see the report for specifics). Models trained on any post-2021 code corpus have plausibly seen substantial fractions of HumanEval during training, and the standard pre-train-then-eval workflow used for code LLMs has no built-in decontamination pass. We score contamination cleanliness at 38.
Why this audit exists. Code models cited in regulated procurement (FDA SaMD for medical-device software, financial-services model risk under SR 11-7, anti-discrimination review under §1557) commonly cite HumanEval as their capability evidence. The benchmark is fit-for-purpose for academic ranking but unfit-for-purpose as the sole capability signal in a model-risk filing. The procurement-grade approach: cite HumanEval + at least one held-out benchmark + a documented contamination check. This audit makes the gap mechanical instead of folkloric.
Dimension-by-dimension reasoning
Format compliance — 95
95 / 100Clean JSONL via HuggingFace, drop-in compatible with the datasets library. Original release on GitHub is canonical. Every problem record carries task_id, prompt, canonical_solution, test, entry_point. The lm-eval-harness, EleutherAI's harness, BigCode's harness, and OpenAI's own evals all consume HumanEval without custom code. 5-point deduction is for the JSONL-only canonical release (no parquet variant ships from OpenAI; HF mirrors offer one).
License clarity — 95
95 / 100MIT license. Unambiguous, commercial-friendly, no inherited license terms from upstream sources (every problem and test is OpenAI-authored, not scraped). This is the model other benchmark releases should follow. 5-point deduction reflects the absence of an explicit ToU addendum addressing the contamination question — should downstream users disclose if their training corpus may have included HumanEval problems?
Test-driven validation — 94
94 / 100This is HumanEval's structural strength. Every problem ships with executable unit tests (avg. 7.7 tests per problem) authored alongside the reference solution. Correctness is determined by passing the tests, not by matching a single canonical solution. This eliminates the label-noise concern that dominates the MMLU audit — there's no "is this the right answer?" disagreement when the answer is "does this code pass these tests?" 6-point deduction reflects the rare cases where tests are themselves under-specified (a test that doesn't cover an edge case the canonical solution handles, or vice versa).
Reproducibility — 92
92 / 100Fully reproducible. Problems and tests are public. The evaluation harness is published (openai/human-eval). Standard pass@k math is well-defined. Anyone can re-run the full evaluation on a model's generations and reproduce the published numbers exactly (modulo sampling temperature). 8-point deduction is for the original sampling-temperature choices in the Codex paper (T=0.2, T=0.8, T=0.95 for different k values) — well-documented but not always replicated faithfully in third-party reports.
Documentation — 88
88 / 100The Codex paper documents construction methodology, problem-distribution statistics, evaluation protocol, and per-domain performance for the original Codex models. The HuggingFace dataset card mirrors this. What's missing: a published per-problem difficulty distribution, a published topic-domain breakdown (how many string-manipulation vs. dynamic-programming vs. recursion-heavy problems), and a formal datasheet-for-datasets following the Gebru et al. template. Adequate for the era; not best-in-class by 2026 standards.
Maintainer reputation — 85
85 / 100OpenAI shipped the benchmark in 2021 alongside the Codex paper. It has not been actively maintained since — no version updates, no errata log, no responsive issue tracker on the openai/human-eval repo. The benchmark has effectively been adopted by the community (BigCode, HuggingFace, lm-eval-harness all maintain their own integrations). For an academic benchmark this is acceptable; for a procurement-cited artifact it's a yellow flag. 15-point deduction reflects the maintenance dynamic, not any historical wrongdoing.
Deduplication / uniqueness — 72
72 / 100Internal deduplication is fine — no two problems within HumanEval are paraphrases of each other. The procurement concern is external: HumanEval problems map to common algorithmic patterns (string reverse, list filtering, classic dynamic-programming patterns) that appear in countless coding-interview websites, LeetCode-style archives, and academic textbooks. The boundary between "HumanEval problem" and "this exact problem on a coding-prep site" is fuzzy. The contamination report quantifies this; the dedup score reflects the boundary fuzziness rather than internal duplication.
Topic / domain coverage — 62
62 / 100164 problems span basic algorithms, string manipulation, mathematical sequences, list/dict operations, and simple control-flow patterns. Notably absent: object-oriented design problems, multi-file projects, async/concurrent code, type-system reasoning, library-use patterns (using requests, pandas, sklearn), debugging-existing-code tasks, refactoring tasks. A model that excels at HumanEval may be weak at any of these. For procurement use cases beyond "generate a small Python function," HumanEval undersells coverage gaps.
Language coverage — 55
55 / 100Python only. For a 2021 benchmark this was sensible (Python was and is the dominant language for ML and Codex-era models). For 2026 procurement use cases, JavaScript / TypeScript, Rust, Go, Java, C++ are all materially under-represented. The MultiPL-E project (Cassano et al.) translated HumanEval to 18 languages, but those translations are downstream artifacts, not part of HumanEval itself. A code model audited solely against HumanEval is audited only for Python capability.
Sample size adequacy — 42
42 / 100164 problems. At typical model performance (70–95% pass@1), Wilson 95% binomial intervals are ±4 to ±7 percentage points. Two models scoring 89% and 92% have overlapping confidence intervals: the published gap is smaller than the binomial noise. Yet HumanEval scores are routinely reported to 0.1% precision in model cards and leaderboards, implying signal the sample size doesn't support. For procurement comparisons within ~5 percentage points, HumanEval cannot adjudicate. Bigger benchmarks (BigCodeBench, APPS, the larger MBPP set) reduce this floor.
Contamination cleanliness — 38
38 / 100Code benchmarks have a distinct contamination problem: the reference solutions get scraped into downstream training corpora. The LabelSets Contamination Report 001 found HumanEval in 11 of 80 scanned post-training datasets at non-zero similarity. The Stack (a 6 TB code corpus widely used for code-model pretraining) explicitly contains HumanEval problems — they exist on GitHub, and The Stack scrapes GitHub. Models trained on any post-2021 code corpus have likely seen HumanEval during training. The standard model-card claim "we evaluated on HumanEval" rarely includes a documented decontamination pass.
The community fix has been "use HumanEval+" (Liu et al. 2023, which adds more rigorous tests to existing problems but doesn't change the prompts) or "use BigCodeBench / LiveCodeBench" which extend the difficulty and freshness profile. Original HumanEval scores in 2026 model cards should be treated as a baseline-completeness check, not a clean capability signal.
Adversarial robustness — 48
48 / 100HumanEval problems use natural-language docstrings as prompts. A model can pass the tests without "solving the problem" by overfitting to docstring keywords or recognizing memorized solutions. Adversarial variants (rephrased docstrings, problem reorderings, prompt-jitter benchmarks) consistently show 5–15 percentage point drops compared to original HumanEval scores. The benchmark is not robust to the kind of distribution shift a model encounters in deployment. 52-point deduction reflects this fragility.
Procurement profile — what this means for code-AI buyers
- For "model card claims X% on HumanEval" as a screening signal: 80 (evaluation profile). Adequate to confirm a model isn't broken at basic Python generation. Inadequate to discriminate between models within ±5 percentage points.
- For procurement-cited capability evidence under SR 11-7 / FDA SaMD / §1557: Cite HumanEval as one signal. Pair with: (a) a held-out benchmark not in the public training corpus surface (BigCodeBench, LiveCodeBench, or a private eval), (b) a documented decontamination check on the model's training corpus, (c) coverage benchmarks for non-Python languages if the deployment target requires them.
- For comparing two leading code models published in 2024–2026: HumanEval cannot adjudicate gaps smaller than 5 percentage points. Use larger benchmarks for fine-grained ranking.
- For academic research: 81. Acceptable. Cite the contamination caveat in your paper's limitations section.
Comparison to the audit series so far
| Report | Dataset | LQS |
|---|---|---|
| 006 · Healthcare | MIMIC-IV (clinical) | 93 / Platinum |
| 005 · Code | HumanEval (code gen) | 83 / Gold |
| 002 · Benchmark | MMLU (LLM eval) | 83 / Gold |
| 003 · Pretraining | RedPajama-V2 (30T tokens) | 81 / Gold |
| 001 · Pretraining | FineWeb-Edu (1.3T tokens) | 73 / Silver |
HumanEval and MMLU tie at LQS 83 / Gold, but for opposite reasons. MMLU's score is held down by label noise and contamination; HumanEval's by sample-size and contamination. The benchmarks are constructed differently — HumanEval's mechanical validation is genuinely better than MMLU's crowdsourced labels — but the procurement-relevant ceiling is the same.
Methodology
This audit was scored under LQS v3.1 with the code-benchmark adapter. Every dimension above maps to a documented rubric in the methodology preprint (DOI 10.5281/zenodo.20278981).
The 7-oracle consensus pass was not run for this report — HumanEval problems are individually graded by their unit tests, which is itself a more robust validation than oracle consensus could provide. The audit is metadata- and structure-based, same lens used for the other reports.
Recourse. If you maintain HumanEval or are otherwise authorized to speak for OpenAI's release and believe any score here is wrong, the recourse process is documented in the methodology preprint §7. File an issue at the public-audit repo with a counter-citation; we publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure. Every published score carries an immutable cert hash; corrections are issued as new versions.
What this audit doesn't claim
- It does not claim HumanEval is a bad benchmark. Gold tier (83) is "fit for procurement with documented caveats." The construction is genuinely better than crowdsourced alternatives.
- It does not claim OpenAI did anything wrong. Every limitation flagged was a sensible 2021 design choice. The procurement-relevant ceiling exists because the benchmark hasn't been refreshed at scale for 2026 use cases.
- It does not predict downstream model performance. A model with a high HumanEval score may still ship code-gen products people love. LQS scores benchmark fitness for procurement evidence, not model utility.
- It does not invalidate the benchmark's academic value. For research methodology comparisons under controlled conditions, HumanEval remains the canonical anchor.
What's next
This is Report 005 in the public-audit series. Coming up:
- Report 004 — The Pile (EleutherAI). Foundational. Known Books3 / copyright surface. Queued.
- Report 007 — A radiology imaging corpus (CheXpert or MIMIC-CXR). First imaging-modality audit under the FDA SaMD lens.
- Contamination Report 002. Pretraining-corpus contamination scan against the 40+ benchmark fingerprints.
Want the next report when it lands?
One email per audit. No marketing. Methodology updates included.
All audits + signup Read the methodology (DOI)