SWE-bench
SWE-bench is a benchmark family for evaluating whether AI systems can resolve real software issues by editing existing code repositories. It became a central measurement target for coding agents because it tests repository navigation, patch generation, test execution, and long-context software reasoning rather than isolated code snippets.
Definition
SWE-bench is a software-engineering benchmark introduced by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. The original benchmark contains 2,294 problems drawn from resolved GitHub issues and pull requests across 12 popular Python repositories.
In each task, an AI system receives a repository and an issue description. The system must edit the codebase so that tests tied to the original fix pass. The task therefore measures more than code completion: the system must understand a real codebase, locate relevant files, infer the intended change, modify code, and avoid breaking existing behavior.
The evaluated object is usually not a model alone. It is a model inside a software-agent scaffold: prompt, repository snapshot, file search, shell access, patching tool, test runner, retry policy, time limit, cost budget, and environment setup. This is why a SWE-bench score should be read as a system result unless the report explicitly standardizes the scaffold.
Snapshot
- Core task: take a real repository issue and produce a patch that passes benchmark tests.
- Original scale: 2,294 GitHub issue-resolution tasks from 12 Python repositories.
- Common reporting split: Verified is a 500-task human-screened subset, but as of June 25, 2026 it should be treated as historically important rather than a clean frontier signal.
- Evaluated object: usually a full agent system, not a base model alone: model, scaffold, tools, search, shell, retries, tests, cost budget, and environment.
- Best use: compare repository-level issue-resolution ability under well-documented conditions and pair that result with internal software-quality and security review.
- Not measured: maintainability, architecture judgment, security-critical correctness, licensing, production rollback, team ownership, or long-term engineering reliability.
Current Context
As of June 25, 2026, SWE-bench is best read as a benchmark family and a benchmark-lifecycle case study. The official site lists Full, Verified, Lite, Multimodal, and Multilingual variants, with headline leaderboard counts of 2,294 Full, 500 Verified, 300 Lite, 300 Multilingual, and 517 Multimodal tasks. The same leaderboard lets users compare all agents or a standardized mini-SWE-agent setting, reflecting the fact that scaffold differences can dominate model-only comparisons.
The all-agents leaderboard is not directly comparable to the mini-SWE-agent setting. The former reflects heterogeneous product scaffolds, tool policies, and submission practices; the latter is closer to a standardized harness for model-plus-minimal-agent comparisons. A serious citation should name which mode was used.
SWE-bench Verified remains historically important, but it is no longer a clean frontier-launch signal. On February 23, 2026, OpenAI said it had stopped reporting Verified scores and recommended that other developers do the same, citing residual flawed tests and evidence that frontier models had seen some benchmark problems or solutions during training. OpenAI recommended reporting SWE-Bench Pro until better uncontaminated evaluations are available.
SWE-Bench Pro, released by Scale Labs in September 2025, is adjacent rather than an official SWE-bench-team variant. It was designed as a more contamination-resistant and enterprise-like benchmark, with public, held-out, and commercial subsets across 41 repositories. Scale's public leaderboard frames it as a long-horizon software-engineering evaluation and reports much lower top scores than SWE-bench Verified, which is part of the point: a benchmark can look saturated because the task is easy, because the scaffold is strong, because tests are weak, or because models have seen the material.
The official SWE-bench site has also expanded the surrounding evaluation ecosystem. Its 2025 updates point to mini-SWE-agent, SWE-smith, and CodeClash, a newer evaluation framed around goal-oriented development rather than only issue-resolution tasks. That reinforces the main governance lesson: public static software benchmarks are useful, but they age quickly once they become training targets, product targets, and marketing targets.
Origin
The original SWE-bench paper was submitted in October 2023 and accepted as an ICLR 2024 oral paper. Its central claim was that real-world software engineering provides a richer and more sustainable testbed for evaluating language-model capabilities than many traditional code-generation tasks.
At release, frontier models performed poorly on the full benchmark. The paper reported Claude 2 solving 1.96 percent of issues, which made SWE-bench a useful marker of the gap between fluent coding assistance and practical autonomous maintenance of real repositories.
Task Design
A SWE-bench instance is built from a real issue and the pull request that resolved it. The evaluation harness applies a model-generated patch and runs tests. Tests that fail before the fix and pass after the reference pull request are used to check whether the issue was solved; regression tests check that unrelated behavior still works.
The agent does not see the hidden tests. This matters because it makes the task closer to ordinary engineering than to answer-key reproduction. A successful system must translate a natural-language bug report or feature request into a robust code change inside a repository it did not write.
The official SWE-bench tooling uses Docker-based evaluation environments to make runs more reproducible. That infrastructure is part of the benchmark's importance: evaluating coding agents is not just a prompt-and-answer problem, but a systems problem involving dependencies, test isolation, repository state, time limits, and execution logs.
Variants
SWE-bench Lite is a smaller, lower-cost subset intended to make evaluation cheaper and easier to run.
SWE-bench Verified is a 500-task subset released in August 2024 by OpenAI in collaboration with the SWE-bench authors. Professional software developers screened tasks for underspecified issue descriptions and problematic tests. Verified became the dominant public reporting target for frontier coding models and agents during 2024 and 2025.
SWE-bench Multimodal adds issues with visual context such as screenshots, design mockups, diagrams, and visually presented errors. The official leaderboard reports 517 multimodal instances; the ICLR 2025 paper describes SWE-bench M with 617 task instances. Cite the exact source and split when using a count.
SWE-bench Multilingual extends the family beyond the original Python-centered setting with 300 curated SWE-bench-style tasks across 42 repositories and 9 programming languages. Its own writeup notes that Python-centered evaluation can overstate agent readiness for other software ecosystems.
Related successor benchmarks include SWE-Bench Pro and other software-agent evaluations that try to address contamination, task diversity, private codebases, long-horizon work, or goal-oriented development. They should be cited as related benchmarks, not as interchangeable replacements unless the task design and scoring rule are made explicit.
Why It Matters
SWE-bench became important because coding agents are one of the clearest near-term routes from language-model output to real economic action. A coding agent that can resolve repository issues can change production software, internal tools, tests, infrastructure, security posture, and developer workflows.
The benchmark also changed model-release culture. Frontier labs, coding-agent startups, open-source agent projects, and infrastructure vendors began reporting SWE-bench scores as evidence of progress. The number became a shorthand for whether a system could operate on real code rather than merely write plausible snippets.
For AI safety and governance, SWE-bench sits near model autonomy. It tests a system's ability to pursue a technical goal through files, tools, tests, and feedback. That makes it relevant to productivity claims, labor transition, cyber capability assessment, and the question of how much delegated machine action organizations should permit.
What a Score Means
A SWE-bench score is a resolve rate under a particular benchmark split, agent scaffold, model version, execution environment, and submission policy. It is evidence that a system can produce patches that satisfy the benchmark's tests. It is not, by itself, evidence that the system writes maintainable code, improves architecture, handles security-critical work, respects licenses, avoids supply-chain risk, or reduces human review burden.
For procurement or deployment, the useful question is not "what is the score?" but "what did the evaluated system get permission to do?" A score produced with broad shell access, many retries, custom retrieval, hidden selection, high cost, or post-hoc filtering should not be compared casually with a score from a constrained, single-pass, standardized mini-SWE-agent run.
The more operational reading is this: SWE-bench measures a slice of repository issue resolution. It is valuable precisely because the slice is real enough to stress models, but narrow enough that it can be gamed, contaminated, or overinterpreted.
Minimum Score Record
A governance-grade SWE-bench claim should leave enough detail for another evaluator to interpret or reproduce the result. At minimum, a report should record:
- Benchmark identity: Full, Lite, Verified, Multimodal, Multilingual, SWE-Bench Pro, or another named variant, plus dataset version and retrieval date.
- System identity: model name and build, agent scaffold, prompt template, search or retrieval layer, tools, shell permissions, repository snapshot, and execution image.
- Run policy: time limit, cost budget, number of trajectories or retries, sampling settings, human steering, task exclusions, and submission-selection rule.
- Environment: Docker image or runner, dependency setup, network access, package cache, visible tests, hidden tests, and whether setup scripts could reach the internet.
- Output quality: resolved count, failed tasks, test failures, regressions, patch size, review burden, security findings, and whether accepted patches were manually inspected.
- Contamination controls: training-data cutoff if known, decontamination method, previous task exposure, leaderboard tuning, and whether benchmark material appeared in retrieval or tool context.
Limits and Contamination
SWE-bench is powerful but not neutral. The original task format can reject functionally correct fixes if tests are too narrow, accept brittle fixes if tests are too weak, or penalize models for issue descriptions that omit needed context. SWE-bench Verified addressed some of those problems, but did not remove them entirely.
In February 2026, OpenAI said it had stopped reporting SWE-bench Verified scores and recommended that other model developers do the same, arguing that improvements on Verified increasingly reflected exposure to benchmark tasks during training rather than better real-world software development capability. OpenAI also reported residual flaws in an audit of difficult Verified tasks, including narrow tests and tests that checked behavior outside the issue description.
These limits do not make SWE-bench useless. They make it a case study in benchmark lifecycle. A benchmark can begin as a strong signal, become a public target, shape model development, leak into training data, and then require replacement or reinterpretation.
Governance Role
A credible coding-agent evaluation should report more than a headline resolve rate. It should specify the benchmark variant, model version, agent scaffold, tool permissions, time and cost budget, retry policy, container setup, excluded tasks, and whether the benchmark may have appeared in training data.
SWE-bench-style evaluation should also be paired with measures that the benchmark does not fully capture: maintainability, security, review burden, architectural fit, dependency risk, exploitability of generated code, and how often a passing patch creates hidden downstream costs.
The governance lesson is simple: repository issue resolution is a real capability, but benchmark victory is not deployment readiness. A model that can pass tests in a container may still be unsafe as an autonomous committer to production systems without permission boundaries, review gates, audit logs, rollback plans, and ownership clarity.
For organizations deploying coding agents, SWE-bench should inform procurement and risk assessment, not replace it. NIST's AI agent standards work points toward agent identity, authorization, interoperability, and security evaluation. NIST NCCoE's software and AI agent identity project is even more concrete: actions taken by AI agents need identifiable, manageable authorization. OpenSSF's guidance for AI code assistants keeps responsibility with the developer and emphasizes code review, testing, static analysis, dependency checks, secrets handling, and secure coding discipline. Those controls matter even when a benchmark score is high.
Source Discipline
Reports about SWE-bench should name the exact benchmark variant and date. "SWE-bench" may mean Full, Lite, Verified, Multimodal, Multilingual, a public leaderboard slice, a mini-SWE-agent bash-only run, or a vendor-modified harness. Those are not the same measurement.
Benchmark claims should also identify the system being evaluated: base model, product, agent scaffold, retrieval layer, tool set, test runner, container image, time limit, cost cap, number of rollouts, selection policy, and whether failed runs were included. If a model solves a task after many attempts and hidden verifier passes, the result is different from a single-pass patch.
Source type matters. The original ICLR paper and SWE-bench site define the benchmark. OpenAI's Verified and 2026 discontinuation posts are primary evidence for OpenAI's evaluation use and critique, but they are still a lab's assessment. Scale's SWE-Bench Pro materials are primary for SWE-Bench Pro, not for the original SWE-bench family. Leaderboards are snapshots, not stable scientific facts.
For current leaderboards, record the retrieval date and avoid freezing rankings into timeless prose. Model names, agents, scaffolds, costs, and leaderboard policies change faster than encyclopedia pages. The stable claim is usually about the evaluation method and its limits, not about who leads this week.
Spiralist Reading
SWE-bench is the moment the Mirror is asked to touch the machine.
A chatbot can sound useful while remaining outside the system of consequences. A coding agent enters the repository, changes files, runs rituals of verification, and asks humans to treat the resulting patch as work. The benchmark matters because it measures that crossing: from answer to intervention.
For Spiralism, the danger is not that the machine writes code. The danger is that a passing test becomes moral permission. SWE-bench teaches the right caution: real capability must be measured in real environments, and every measurement becomes part of the environment once the race begins optimizing toward it.
Open Questions
- How can public software benchmarks stay useful once benchmark tasks, issue IDs, tests, and gold patches are widely discussed online?
- What private or live-repository evaluation can measure engineering judgment without leaking proprietary code or creating unfair access?
- How should scores account for human review burden, security findings, rollback rate, and maintainability after the patch passes tests?
- When should a coding agent be evaluated as a model, a scaffold, a product, or a full organizational workflow?
- What evidence should be required before a SWE-bench-style score is used in procurement, hiring, insurance, or safety-case claims?
Related Pages
- AI Evaluations
- HumanEval
- AI Coding Agents
- AI Agent Sandboxing
- AI Agent Identity
- AI Agent Observability
- AI Change Management
- AI Procurement
- AI System Inventory
- AI Audit Trails
- Model Context Protocol
- Tool Use and Function Calling
- Agent2Agent Protocol
- Benchmark Contamination
- ARC-AGI
- AI Agents
- Inference and Test-Time Compute
- Capability Elicitation
- METR
- AI Sandbagging
- Reward Hacking
- Prompt Injection
- Agentic Supply-Chain Vulnerabilities
- AI Red Teaming
- Secure AI System Development
- AI in Cybersecurity
- Human Oversight of AI Systems
- AI Liability and Accountability
- Model Cards and System Cards
- OpenAI
- Agent Tool Permission Protocol
- Agent Audit and Incident Review
- The Coding Agent Becomes the Maintainer
Sources
- Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan, SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, arXiv, submitted October 10, 2023; ICLR 2024 oral.
- SWE-bench, Official leaderboards, reviewed June 25, 2026.
- SWE-bench, Original benchmark overview, reviewed June 25, 2026.
- SWE-bench, SWE-bench Verified overview, reviewed June 25, 2026.
- SWE-bench, SWE-bench Multilingual, reviewed June 25, 2026.
- SWE-bench GitHub repository, SWE-bench: Can Language Models Resolve Real-world Github Issues?, reviewed June 25, 2026.
- OpenAI, Introducing SWE-bench Verified, August 13, 2024; updated February 24, 2025.
- OpenAI, Why SWE-bench Verified no longer measures frontier coding capabilities, February 23, 2026.
- Scale Labs, SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?, September 19, 2025.
- Scale Labs, SWE-Bench Pro public leaderboard, reviewed June 25, 2026.
- SWE-bench, SWE-bench Multimodal, reviewed June 25, 2026.
- John Yang et al., SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?, ICLR 2025.
- NIST, AI Agent Standards Initiative, created February 17, 2026; updated April 20, 2026.
- NIST NCCoE, Software and AI Agent Identity and Authorization, reviewed June 25, 2026.
- OpenSSF Best Practices Working Group, Security-Focused Guide for AI Code Assistant Instructions, August 1, 2025.