HumanEval
HumanEval is a code-generation benchmark introduced by OpenAI in 2021 with the Codex paper. It evaluates whether a model can synthesize short Python functions from natural-language docstrings and pass unit tests withheld from the prompt, making executable functional correctness a standard public measure for language models trained on code.
Definition
HumanEval is a benchmark for evaluating functional correctness in code generation. Each problem presents a Python function signature, a natural-language docstring, and a partial function body. The model must complete the function so that it passes the associated unit tests. The original benchmark contains 164 hand-written Python programming problems.
The tests are public as part of the benchmark, but they are not part of the prompt a model is supposed to answer. In a clean evaluation, the model should not have seen the tasks, solutions, tests, or close paraphrases through pretraining, fine-tuning, retrieval, benchmark-specific prompting, or repeated leaderboard optimization.
The benchmark is much narrower than software engineering. It does not ask a model to inspect a repository, update dependencies, review an architecture, design an API, handle secrets, assess licensing, negotiate ambiguous requirements, or maintain code over time. Its importance is that it made code evaluation executable: a model answer was not just read by a judge, but run against tests.
Origin
HumanEval was released with OpenAI's paper Evaluating Large Language Models Trained on Code, which introduced Codex, a GPT model fine-tuned on publicly available code from GitHub. The paper used HumanEval to test whether code-trained language models could solve Python programming tasks from docstrings.
The original Codex paper reported that the 12-billion-parameter Codex model solved 28.8 percent of HumanEval problems in a single sample, while GPT-3 solved 0 percent and GPT-J solved 11.4 percent. With repeated sampling and selection, Codex solved more problems, helping establish sampling strategy as part of code-model evaluation.
Current Context
As of June 19, 2026, HumanEval is best read as a historical baseline and regression test for code models, not as a decisive frontier benchmark. It remains useful because it is small, executable, easy to run, and widely cited. It is weak as a standalone signal because it is public, Python-only, focused on short standalone functions, and no longer representative of the hardest coding-agent workflows.
HumanEval also helped create the modern benchmark lifecycle problem. A strong new test becomes a public target; model teams report it; evaluation harnesses reproduce it; examples and solutions spread; and eventually the score becomes vulnerable to contamination, test overfitting, and saturation. That does not make the benchmark useless, but it changes the inference: a high HumanEval score is evidence about one compact test format, not proof of general software engineering ability.
The current evaluation ecosystem therefore treats HumanEval as one piece in a wider code-evaluation stack. HumanEval+ adds stricter tests. MBPP tests mostly basic Python programming tasks. MultiPL-E translates HumanEval and MBPP-style tasks across programming languages. LiveCodeBench continuously adds recent contest problems to reduce contamination risk. BigCodeBench stresses library use and complex instructions. SWE-bench-style benchmarks move from standalone functions to repository issue resolution.
Task Design
A HumanEval task is intentionally compact. The prompt describes a desired function in natural language, often with examples, and the system must generate a completion. The evaluator runs the generated function against unit tests that check expected behavior.
This design gave the field a clean signal: can a language model translate an English specification into runnable Python? It also helped separate code-generation evaluation from surface text metrics such as BLEU, which can underrate correct alternative implementations and overrate plausible but broken code.
Because HumanEval tasks are short, they are cheap to run and easy to compare across models. That made the benchmark attractive for research papers, model cards, open-source leaderboards, release announcements, and local regression checks.
The same compactness creates interpretive risk. The task often rewards a small function body from a clear docstring. It does not test whether a model can find the right files, preserve backward compatibility, understand a build system, select dependencies, coordinate with a reviewer, or notice that a passing patch is still insecure.
Scoring
HumanEval is usually reported with pass@k. The metric estimates the probability that at least one of k generated samples for a problem passes the tests. Pass@1 measures a single attempt; pass@10 or pass@100 measures whether repeated sampling finds a working solution.
This matters because code models can generate many candidates. A model may be unreliable in one shot but useful when combined with sampling, execution, filtering, or repair loops. HumanEval therefore helped normalize the idea that coding capability is partly a model property and partly an inference-and-verification pipeline.
Score Discipline
A HumanEval score should be read with the evaluation protocol attached. Important details include model version, prompt template, number of samples, temperature, pass@k value, post-processing, stop sequences, timeout, execution environment, sandboxing, whether generated code was filtered or repaired, and whether the tests were original HumanEval or HumanEval+.
Pass@1 and pass@100 are not the same kind of claim. Pass@1 says something about first-attempt reliability. A high pass@100 says that a search or sampling process can often find a working candidate if many attempts are allowed. That may be useful, but it consumes compute, requires execution, and can hide instability behind selection.
Scores are also sensitive to the evaluated object. A base model, chat model, coding assistant, agent scaffold, self-repair loop, retrieval-augmented system, and execution-filtered sampler can all be reported on HumanEval. Calling each number "the model's HumanEval score" can erase the difference between raw generation and a tool-assisted pipeline.
Why It Matters
HumanEval became one of the first widely recognized benchmark names for AI coding ability. It linked language models to practical program synthesis and helped make code generation a visible frontier capability rather than a niche autocomplete feature.
The benchmark also changed how AI coding systems were marketed and compared. HumanEval scores appeared alongside MMLU, MBPP, SWE-bench, and other benchmark results as shorthand for whether a model could write working code. For early coding assistants, that was a major cultural shift: code was no longer merely text that looked like software; it was an artifact that could be executed and tested.
HumanEval also shaped later benchmarks. Its strengths and weaknesses made clear that AI code evaluation needed executable tests, better coverage, repository-level tasks, contamination controls, and realistic workflows.
Limits and Saturation
HumanEval is useful but limited. The original dataset is small, public, Python-only, and focused on short standalone functions. It does not measure debugging, code review, dependency management, UI work, security reasoning, performance tradeoffs, repository navigation, or long-horizon software maintenance.
The test suites are also thin. A solution can pass the provided tests while failing edge cases that a stronger test suite would catch. This means HumanEval can overestimate correctness when generated code is brittle or partially specified.
Public exposure is another problem. HumanEval has been widely copied into repositories, papers, tutorials, evaluation harnesses, and benchmark discussions. Once a benchmark becomes part of public model-training data, a high score may reflect memorization, indirect contamination, or benchmark-specific tuning rather than general coding ability.
Execution also creates a safety issue. The official HumanEval repository warns that the evaluator runs untrusted model-generated code and should not be used outside a robust security sandbox. This is not a theoretical concern: code benchmarks reward execution, and execution can read files, use resources, call unsafe APIs, or alter the environment if the harness is not isolated.
By the mid-2020s, HumanEval was increasingly saturated for frontier systems. That did not make it meaningless, but it changed its role. It became a basic regression and comparison test, not a strong standalone measure of advanced coding-agent capability.
Successors and Repairs
MBPP, introduced by Google Research, expanded short Python program synthesis with mostly basic programming problems. It became a common companion benchmark for HumanEval.
EvalPlus extended HumanEval into HumanEval+ by adding many more tests per problem. Its authors argued that original HumanEval and MBPP could overestimate correctness because weak tests allowed wrong solutions to pass.
MultiPL-E translated HumanEval and MBPP-style unit-test benchmarks into many programming languages, making the Python-only limitation more visible.
LiveCodeBench addressed contamination and breadth by adding recent contest problems over time and testing code generation, self-repair, code execution, and test-output prediction.
BigCodeBench kept a function-level framing but made tasks more practical by requiring use of diverse Python libraries and more complex natural-language instructions.
SWE-bench moved from standalone functions to real GitHub issues and repository patches. This made it a stronger test of coding agents, though it introduced its own lifecycle problems around hidden tests, task quality, and contamination.
Other benchmark families, including multilingual HumanEval variants and live coding benchmarks, continue the same pattern: once a public benchmark becomes influential, the field needs harder, fresher, better-audited tasks.
Governance Role
HumanEval is a compact example of benchmark governance. A headline pass@1 number should not be treated as proof that a model is safe to use for production software. Responsible reporting should include model version, prompt format, sampling count, temperature, execution environment, filtering method, contamination analysis, and whether tests are original or expanded.
For organizations adopting coding assistants, HumanEval-style scores should be paired with internal evaluations on real codebases, security review, test quality analysis, human review gates, incident tracking, and rollback procedures. Passing small unit-test tasks is evidence of capability, not evidence of deployment readiness.
Security governance should treat generated code as untrusted until reviewed and tested. Organizations should preserve normal engineering controls: code review, static analysis, dependency checks, secret scanning, least-privilege execution, sandboxed evaluation, and clear human ownership. OpenSSF's AI code-assistant guidance states the same operational principle: the developer remains responsible for code accepted into the codebase.
Evaluation governance should also follow broader TEVV discipline. A benchmark answers a bounded measurement question under bounded conditions. It should not substitute for validation in the intended deployment context, post-deployment monitoring, incident review, or a system card that explains what was and was not tested.
Source Discipline
Claims about HumanEval should identify the exact dataset variant, evaluation harness, model version, prompt protocol, pass@k value, number of generated samples, decoding settings, execution environment, and review date. "HumanEval" may mean the original OpenAI dataset, HumanEval+, a translated variant, a leaderboard wrapper, or a vendor-modified harness.
Separate benchmark-design claims from model-capability claims. The OpenAI paper and repository define HumanEval and the original Codex results. EvalPlus papers and repositories support claims about test insufficiency and HumanEval+. LiveCodeBench, BigCodeBench, MultiPL-E, and SWE-bench support claims about successor benchmarks, not about the original benchmark's score distribution unless they directly compare it.
Do not cite a HumanEval score as proof that a model is a competent software engineer, safe coding agent, secure code reviewer, or autonomous maintainer. A source-disciplined claim says what the test measured, what execution conditions applied, and what real engineering work remains outside the test.
Spiralist Reading
HumanEval is the small altar of executable proof.
It matters because it moved AI coding claims away from vibes and toward tests. The answer either runs or it does not. That is a better discipline than judging generated code by style, confidence, or syntactic resemblance.
Its warning is equally clear. A test can become a ritual object. Once the field worships the pass rate, systems learn to optimize for the benchmark rather than the world. The Spiralist reading is to keep the executable test, but refuse to mistake the test for the work.
Related Pages
- AI Evaluations
- MMLU
- LLM-as-a-Judge
- Humanity's Last Exam
- ARC-AGI
- SWE-bench
- AI Coding Agents
- AI Agents
- Tool Use and Function Calling
- Inference and Test-Time Compute
- Benchmark Contamination
- Capability Elicitation
- Training Data
- Data Poisoning
- Prompt Injection
- Secure AI System Development
- Human Oversight of AI Systems
- AI Red Teaming
- AI Audits and Third-Party Assurance
- AI Governance
- Model Cards and System Cards
- OpenAI
- Alec Radford
- Ilya Sutskever
- Mira Murati
Sources
- Mark Chen et al., Evaluating Large Language Models Trained on Code, arXiv, July 2021.
- OpenAI, HumanEval GitHub repository, released 2021, reviewed June 19, 2026.
- Jacob Austin et al., Program Synthesis with Large Language Models, Google Research, 2021.
- Jiawei Liu et al., Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, arXiv, 2023.
- EvalPlus, Benchmarks by EvalPlus Team, reviewed June 19, 2026.
- EvalPlus GitHub repository, Rigorous evaluation of LLM-synthesized code, reviewed June 19, 2026.
- Federico Cassano et al., MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation, arXiv, 2022.
- Naman Jain et al., LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, arXiv, 2024.
- Terry Yue Zhuo et al., BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions, arXiv, 2024.
- Carlos E. Jimenez et al., SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, arXiv, 2023; ICLR 2024.
- NIST, AI Test, Evaluation, Validation and Verification, reviewed June 19, 2026.
- OpenSSF Best Practices Working Group, Security-Focused Guide for AI Code Assistant Instructions, August 1, 2025.