Home·Curated Catalog·NLP / Text
💬 Curated Catalog · NLP / Text

HumanEval — Code Generation Benchmark

164 hand-written Python programming problems with unit tests — the canonical pass@k code benchmark.

LQS 73 · silver ✓ Commercial OK 164 programming problems 1 MB JSONL Released 2021
Browse commercial NLP / Text → Visit original source ↗
Source: github.com · maintained by OpenAI
164
programming problems
1 MB
Size on disk
73
LQS · silver
2021
First released

About this dataset

HumanEval is the standard pass@k benchmark for evaluating LLM code generation. It contains 164 hand-written Python programming problems, each with a function signature, docstring, body, and several unit tests. Released by OpenAI alongside Codex. Despite its small size, it remains the most-cited benchmark in code-LLM papers, frequently paired with MBPP and the larger HumanEval-X multilingual extension.

Maintainer
License
Formats
JSONL

LabelSets Quality Score

LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →

73
out of 100
silver tier

Solid dataset with some trade-offs

Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.

Completeness 92
No public completeness metric; using prior for 'research_release' datasets.
Uniqueness 68
Minimal deduplication disclosed.
Validation 68
Crowdsourced labels without disclosed QC protocol.
Size adequacy 60
164 items — below 100,000 target for NLP / Text, but usable.
Format compliance 95
Industry-standard format — drop-in compatible with mainstream tooling.
Label density 52
Average 1.0 labels per item (sparse).
Class balance 75
Moderate class skew — realistic production distribution.

What it's used for

Common tasks and benchmarks where HumanEval — Code Generation Benchmark is the default or competitive choice.

Sample statistics

What's actually in the dataset — from the maintainer's published stats.

164 problems, each with function signature + docstring + reference solution + test cases (avg ~7.7 tests/problem).

License

HumanEval — Code Generation Benchmark is distributed under MIT. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.

Need commercial-licensed NLP / Text data?

LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:

Browse paid NLP / Text → Sell your dataset

Similar public datasets

Other entries in the NLP / Text catalog.

Frequently Asked Questions

HumanEval — Code Generation Benchmark is distributed under MIT, which generally permits commercial use. Always verify the current license terms with the maintainer (OpenAI) before using in a commercial product.
HumanEval — Code Generation Benchmark contains 164 programming problems. 164 problems, each with function signature + docstring + reference solution + test cases (avg ~7.7 tests/problem).
HumanEval — Code Generation Benchmark is maintained by OpenAI and is available at https://github.com/openai/human-eval. LabelSets indexes and scores this dataset for discoverability but does not redistribute it.
LQS is a 7-dimension quality score (completeness, uniqueness, validation, size adequacy, format compliance, label density, class balance) computed from the dataset's published statistics. Composite scores map to tiers: platinum (≥90), gold (≥75), silver (≥60), bronze (<60). Read the full methodology.