Public LQS Audit · Report 005 · Code

HumanEval — a procurement-grade audit of the code benchmark every model card cites.

Composite 83 / 100. Gold tier. 164 hand-crafted Python problems. The cleanest construction of any LLM benchmark we've audited — every problem is unit-tested for correctness rather than vote-tallied — but the dataset's own design choices make it inadequate as a sole capability signal. Two procurement-relevant ceilings: a sample-size confidence interval wider than the gaps between leading models, and a contamination surface that grows every month The Stack and GitHub crawls do. Open methodology, signed result.

Published May 19, 2026 · LabelSets Research · 9 min read · Author: Alex Adrion

83 / 100

Gold

LQS v3.1 composite · default profile
evaluation profile: 80 · classification profile: 82 · RAG profile: 85

Format compliance

License clarity

Test-driven validation

Reproducibility

Documentation

Maintainer reputation

Deduplication / uniqueness

Topic / domain coverage

Language coverage

Sample size adequacy

Contamination cleanliness

Adversarial robustness

What we audited

Dataset	openai/openai_humaneval
Size	164 hand-crafted Python problems
Structure	Each problem: function signature + docstring + reference solution + unit test suite (avg. 7.7 tests per problem)
Modality	Code generation (Python only)
License	`MIT` — permissive, commercial-friendly, unambiguous
Maintainer	OpenAI (Mark Chen, Jerry Tworek, Heewoo Jun et al.)
Paper	arXiv:2107.03374 (Codex paper, 2021)
Year released	2021
Evaluation metric	`pass@k` — fraction of problems with at least one passing solution in k samples
Distribution format	JSONL via HuggingFace, also available in original GitHub release
Citations (Google Scholar)	4,500+ — the standard code-generation benchmark in LLM research

The headline finding

HumanEval was constructed correctly. Every problem has a reference solution and an executable unit-test suite, so "correctness" means "the generated code passes the tests" — not "the generated code matches the reference exactly." This is procurement-grade construction: the validation function is mechanical, not crowdsourced. It's why we score test-driven validation at 94 and label noise as effectively a non-issue. Compare this to MMLU, where label noise was the second-biggest concern.

But construction-quality alone doesn't determine procurement-fitness. Two structural features mean HumanEval's reported scores carry less signal than the precision of the numbers implies.

164 problems is a small sample. A 95% binomial confidence interval at 80% pass@1 is roughly ±6.1 percentage points (Wilson interval). A model reported at 87% and a model reported at 84% have overlapping 95% intervals; the gap is smaller than the noise floor introduced by the sample size. Procurement teams comparing models on the basis of "Model A scored 91% on HumanEval, Model B scored 88%" are reading more signal into the numbers than the sample supports. The benchmark itself does not control for this — it simply reports pass@1, pass@10, pass@100 without intervals. We score sample-size adequacy at 42.
Contamination is asymmetric and growing. Code benchmarks have a unique contamination profile: the reference solutions get scraped into downstream training corpora like The Stack, GitHub crawls, code-generation datasets, and various RLHF preference sets. The LabelSets contamination scanner finds HumanEval in 11 of 80 scanned post-training datasets at non-zero similarity (see the report for specifics). Models trained on any post-2021 code corpus have plausibly seen substantial fractions of HumanEval during training, and the standard pre-train-then-eval workflow used for code LLMs has no built-in decontamination pass. We score contamination cleanliness at 38.

Why this audit exists. Code models cited in regulated procurement (FDA SaMD for medical-device software, financial-services model risk under SR 11-7, anti-discrimination review under §1557) commonly cite HumanEval as their capability evidence. The benchmark is fit-for-purpose for academic ranking but unfit-for-purpose as the sole capability signal in a model-risk filing. The procurement-grade approach: cite HumanEval + at least one held-out benchmark + a documented contamination check. This audit makes the gap mechanical instead of folkloric.

Dimension-by-dimension reasoning

Format compliance — 95

95 / 100

Clean JSONL via HuggingFace, drop-in compatible with the datasets library. Original release on GitHub is canonical. Every problem record carries task_id, prompt, canonical_solution, test, entry_point. The lm-eval-harness, EleutherAI's harness, BigCode's harness, and OpenAI's own evals all consume HumanEval without custom code. 5-point deduction is for the JSONL-only canonical release (no parquet variant ships from OpenAI; HF mirrors offer one).

Sources: HF dataset card · openai/human-eval GitHub repo · BigCode evaluation harness

License clarity — 95

95 / 100

MIT license. Unambiguous, commercial-friendly, no inherited license terms from upstream sources (every problem and test is OpenAI-authored, not scraped). This is the model other benchmark releases should follow. 5-point deduction reflects the absence of an explicit ToU addendum addressing the contamination question — should downstream users disclose if their training corpus may have included HumanEval problems?

Sources: openai/human-eval LICENSE file · Codex paper Section 6 (release statement)

Test-driven validation — 94

94 / 100

This is HumanEval's structural strength. Every problem ships with executable unit tests (avg. 7.7 tests per problem) authored alongside the reference solution. Correctness is determined by passing the tests, not by matching a single canonical solution. This eliminates the label-noise concern that dominates the MMLU audit — there's no "is this the right answer?" disagreement when the answer is "does this code pass these tests?" 6-point deduction reflects the rare cases where tests are themselves under-specified (a test that doesn't cover an edge case the canonical solution handles, or vice versa).

Sources: openai/human-eval test files · Liu et al. 2023 (HumanEval+) which identified ~10% of HumanEval problems with under-tested edge cases

Reproducibility — 92

92 / 100

Fully reproducible. Problems and tests are public. The evaluation harness is published (openai/human-eval). Standard pass@k math is well-defined. Anyone can re-run the full evaluation on a model's generations and reproduce the published numbers exactly (modulo sampling temperature). 8-point deduction is for the original sampling-temperature choices in the Codex paper (T=0.2, T=0.8, T=0.95 for different k values) — well-documented but not always replicated faithfully in third-party reports.

Sources: openai/human-eval repo · Codex paper Section 2 (evaluation methodology) · BigCode harness documentation

Documentation — 88

88 / 100

The Codex paper documents construction methodology, problem-distribution statistics, evaluation protocol, and per-domain performance for the original Codex models. The HuggingFace dataset card mirrors this. What's missing: a published per-problem difficulty distribution, a published topic-domain breakdown (how many string-manipulation vs. dynamic-programming vs. recursion-heavy problems), and a formal datasheet-for-datasets following the Gebru et al. template. Adequate for the era; not best-in-class by 2026 standards.

Sources: arXiv:2107.03374 Section 2 · HF dataset card · LQS documentation rubric

Maintainer reputation — 85

85 / 100

OpenAI shipped the benchmark in 2021 alongside the Codex paper. It has not been actively maintained since — no version updates, no errata log, no responsive issue tracker on the openai/human-eval repo. The benchmark has effectively been adopted by the community (BigCode, HuggingFace, lm-eval-harness all maintain their own integrations). For an academic benchmark this is acceptable; for a procurement-cited artifact it's a yellow flag. 15-point deduction reflects the maintenance dynamic, not any historical wrongdoing.

Sources: openai/human-eval issue tracker · GitHub commit history · third-party maintainer activity (BigCode, HF)

Deduplication / uniqueness — 72

72 / 100

Internal deduplication is fine — no two problems within HumanEval are paraphrases of each other. The procurement concern is external: HumanEval problems map to common algorithmic patterns (string reverse, list filtering, classic dynamic-programming patterns) that appear in countless coding-interview websites, LeetCode-style archives, and academic textbooks. The boundary between "HumanEval problem" and "this exact problem on a coding-prep site" is fuzzy. The contamination report quantifies this; the dedup score reflects the boundary fuzziness rather than internal duplication.

Sources: independent overlap analyses · LeetCode pattern catalogues · LQS uniqueness rubric for benchmark questions

Topic / domain coverage — 62

62 / 100

164 problems span basic algorithms, string manipulation, mathematical sequences, list/dict operations, and simple control-flow patterns. Notably absent: object-oriented design problems, multi-file projects, async/concurrent code, type-system reasoning, library-use patterns (using requests, pandas, sklearn), debugging-existing-code tasks, refactoring tasks. A model that excels at HumanEval may be weak at any of these. For procurement use cases beyond "generate a small Python function," HumanEval undersells coverage gaps.

Sources: per-problem categorization (LabelSets internal) · BigCodeBench paper (Zhuo et al. 2024) which addresses this gap · APPS benchmark which extends the difficulty range

Language coverage — 55

55 / 100

Python only. For a 2021 benchmark this was sensible (Python was and is the dominant language for ML and Codex-era models). For 2026 procurement use cases, JavaScript / TypeScript, Rust, Go, Java, C++ are all materially under-represented. The MultiPL-E project (Cassano et al.) translated HumanEval to 18 languages, but those translations are downstream artifacts, not part of HumanEval itself. A code model audited solely against HumanEval is audited only for Python capability.

Sources: openai/human-eval problem listing · MultiPL-E paper (Cassano et al. 2023) · LQS language-coverage rubric

Sample size adequacy — 42

42 / 100

164 problems. At typical model performance (70–95% pass@1), Wilson 95% binomial intervals are ±4 to ±7 percentage points. Two models scoring 89% and 92% have overlapping confidence intervals: the published gap is smaller than the binomial noise. Yet HumanEval scores are routinely reported to 0.1% precision in model cards and leaderboards, implying signal the sample size doesn't support. For procurement comparisons within ~5 percentage points, HumanEval cannot adjudicate. Bigger benchmarks (BigCodeBench, APPS, the larger MBPP set) reduce this floor.

Sources: Wilson 1927 binomial interval · Codex paper sample-size statistics · LQS size-adequacy rubric for benchmarks

Contamination cleanliness — 38

38 / 100

Code benchmarks have a distinct contamination problem: the reference solutions get scraped into downstream training corpora. The LabelSets Contamination Report 001 found HumanEval in 11 of 80 scanned post-training datasets at non-zero similarity. The Stack (a 6 TB code corpus widely used for code-model pretraining) explicitly contains HumanEval problems — they exist on GitHub, and The Stack scrapes GitHub. Models trained on any post-2021 code corpus have likely seen HumanEval during training. The standard model-card claim "we evaluated on HumanEval" rarely includes a documented decontamination pass.

The community fix has been "use HumanEval+" (Liu et al. 2023, which adds more rigorous tests to existing problems but doesn't change the prompts) or "use BigCodeBench / LiveCodeBench" which extend the difficulty and freshness profile. Original HumanEval scores in 2026 model cards should be treated as a baseline-completeness check, not a clean capability signal.

Sources: LabelSets Contamination Report 001 · The Stack documentation · BigCodeBench paper · LiveCodeBench paper

Adversarial robustness — 48

48 / 100

HumanEval problems use natural-language docstrings as prompts. A model can pass the tests without "solving the problem" by overfitting to docstring keywords or recognizing memorized solutions. Adversarial variants (rephrased docstrings, problem reorderings, prompt-jitter benchmarks) consistently show 5–15 percentage point drops compared to original HumanEval scores. The benchmark is not robust to the kind of distribution shift a model encounters in deployment. 52-point deduction reflects this fragility.

Sources: prompt-jitter studies on code LLMs · HumanEval+ adversarial analysis · LiveCodeBench freshness-based stress tests

Procurement profile — what this means for code-AI buyers

For "model card claims X% on HumanEval" as a screening signal: 80 (evaluation profile). Adequate to confirm a model isn't broken at basic Python generation. Inadequate to discriminate between models within ±5 percentage points.
For procurement-cited capability evidence under SR 11-7 / FDA SaMD / §1557: Cite HumanEval as one signal. Pair with: (a) a held-out benchmark not in the public training corpus surface (BigCodeBench, LiveCodeBench, or a private eval), (b) a documented decontamination check on the model's training corpus, (c) coverage benchmarks for non-Python languages if the deployment target requires them.
For comparing two leading code models published in 2024–2026: HumanEval cannot adjudicate gaps smaller than 5 percentage points. Use larger benchmarks for fine-grained ranking.
For academic research: 81. Acceptable. Cite the contamination caveat in your paper's limitations section.

Comparison to the audit series so far

Report	Dataset	LQS
006 · Healthcare	MIMIC-IV (clinical)	93 / Platinum
005 · Code	HumanEval (code gen)	83 / Gold
002 · Benchmark	MMLU (LLM eval)	83 / Gold
003 · Pretraining	RedPajama-V2 (30T tokens)	81 / Gold
001 · Pretraining	FineWeb-Edu (1.3T tokens)	73 / Silver

HumanEval and MMLU tie at LQS 83 / Gold, but for opposite reasons. MMLU's score is held down by label noise and contamination; HumanEval's by sample-size and contamination. The benchmarks are constructed differently — HumanEval's mechanical validation is genuinely better than MMLU's crowdsourced labels — but the procurement-relevant ceiling is the same.

Methodology

This audit was scored under LQS v3.1 with the code-benchmark adapter. Every dimension above maps to a documented rubric in the methodology preprint (DOI 10.5281/zenodo.20278981).

The 7-oracle consensus pass was not run for this report — HumanEval problems are individually graded by their unit tests, which is itself a more robust validation than oracle consensus could provide. The audit is metadata- and structure-based, same lens used for the other reports.

Recourse. If you maintain HumanEval or are otherwise authorized to speak for OpenAI's release and believe any score here is wrong, the recourse process is documented in the methodology preprint §7. File an issue at the public-audit repo with a counter-citation; we publish a v1.1 with the correction and a changelog. We do not modify scores under non-public pressure. Every published score carries an immutable cert hash; corrections are issued as new versions.

What this audit doesn't claim

It does not claim HumanEval is a bad benchmark. Gold tier (83) is "fit for procurement with documented caveats." The construction is genuinely better than crowdsourced alternatives.
It does not claim OpenAI did anything wrong. Every limitation flagged was a sensible 2021 design choice. The procurement-relevant ceiling exists because the benchmark hasn't been refreshed at scale for 2026 use cases.
It does not predict downstream model performance. A model with a high HumanEval score may still ship code-gen products people love. LQS scores benchmark fitness for procurement evidence, not model utility.
It does not invalidate the benchmark's academic value. For research methodology comparisons under controlled conditions, HumanEval remains the canonical anchor.

What's next

This is Report 005 in the public-audit series. Coming up:

Report 004 — The Pile (EleutherAI). Foundational. Known Books3 / copyright surface. Queued.
Report 007 — A radiology imaging corpus (CheXpert or MIMIC-CXR). First imaging-modality audit under the FDA SaMD lens.
Contamination Report 002. Pretraining-corpus contamination scan against the 40+ benchmark fingerprints.

Want the next report when it lands?

One email per audit. No marketing. Methodology updates included.

All audits + signup Read the methodology (DOI)