164 hand-written Python programming problems with unit tests — the canonical pass@k code benchmark.
Browse commercial NLP / Text → Visit original source ↗HumanEval is the standard pass@k benchmark for evaluating LLM code generation. It contains 164 hand-written Python programming problems, each with a function signature, docstring, body, and several unit tests. Released by OpenAI alongside Codex. Despite its small size, it remains the most-cited benchmark in code-LLM papers, frequently paired with MBPP and the larger HumanEval-X multilingual extension.
LQS is our 7-dimension quality score, computed from the dataset's published statistics. See methodology →
Composite score computed from the 7 dimensions below: completeness, uniqueness, validation health, size adequacy, format compliance, label density, and class balance.
Common tasks and benchmarks where HumanEval — Code Generation Benchmark is the default or competitive choice.
What's actually in the dataset — from the maintainer's published stats.
HumanEval — Code Generation Benchmark is distributed under MIT. This is a third-party public dataset; LabelSets indexes and scores it but does not host or redistribute the data. Always verify current license terms with the maintainer before commercial use.
LabelSets sellers offer paid nlp / text datasets with what public datasets often can't give you:
Other entries in the NLP / Text catalog.