Wiki · Concept · Last reviewed June 25, 2026

SWE-bench

SWE-bench is a benchmark family for evaluating whether AI systems can resolve real software issues by editing existing code repositories. It became a central measurement target for coding agents because it tests repository navigation, patch generation, test execution, and long-context software reasoning rather than isolated code snippets.

Definition

SWE-bench is a software-engineering benchmark introduced by Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. The original benchmark contains 2,294 problems drawn from resolved GitHub issues and pull requests across 12 popular Python repositories.

In each task, an AI system receives a repository and an issue description. The system must edit the codebase so that tests tied to the original fix pass. The task therefore measures more than code completion: the system must understand a real codebase, locate relevant files, infer the intended change, modify code, and avoid breaking existing behavior.

The evaluated object is usually not a model alone. It is a model inside a software-agent scaffold: prompt, repository snapshot, file search, shell access, patching tool, test runner, retry policy, time limit, cost budget, and environment setup. This is why a SWE-bench score should be read as a system result unless the report explicitly standardizes the scaffold.

Snapshot

Current Context

As of June 25, 2026, SWE-bench is best read as a benchmark family and a benchmark-lifecycle case study. The official site lists Full, Verified, Lite, Multimodal, and Multilingual variants, with headline leaderboard counts of 2,294 Full, 500 Verified, 300 Lite, 300 Multilingual, and 517 Multimodal tasks. The same leaderboard lets users compare all agents or a standardized mini-SWE-agent setting, reflecting the fact that scaffold differences can dominate model-only comparisons.

The all-agents leaderboard is not directly comparable to the mini-SWE-agent setting. The former reflects heterogeneous product scaffolds, tool policies, and submission practices; the latter is closer to a standardized harness for model-plus-minimal-agent comparisons. A serious citation should name which mode was used.

SWE-bench Verified remains historically important, but it is no longer a clean frontier-launch signal. On February 23, 2026, OpenAI said it had stopped reporting Verified scores and recommended that other developers do the same, citing residual flawed tests and evidence that frontier models had seen some benchmark problems or solutions during training. OpenAI recommended reporting SWE-Bench Pro until better uncontaminated evaluations are available.

SWE-Bench Pro, released by Scale Labs in September 2025, is adjacent rather than an official SWE-bench-team variant. It was designed as a more contamination-resistant and enterprise-like benchmark, with public, held-out, and commercial subsets across 41 repositories. Scale's public leaderboard frames it as a long-horizon software-engineering evaluation and reports much lower top scores than SWE-bench Verified, which is part of the point: a benchmark can look saturated because the task is easy, because the scaffold is strong, because tests are weak, or because models have seen the material.

The official SWE-bench site has also expanded the surrounding evaluation ecosystem. Its 2025 updates point to mini-SWE-agent, SWE-smith, and CodeClash, a newer evaluation framed around goal-oriented development rather than only issue-resolution tasks. That reinforces the main governance lesson: public static software benchmarks are useful, but they age quickly once they become training targets, product targets, and marketing targets.

Origin

The original SWE-bench paper was submitted in October 2023 and accepted as an ICLR 2024 oral paper. Its central claim was that real-world software engineering provides a richer and more sustainable testbed for evaluating language-model capabilities than many traditional code-generation tasks.

At release, frontier models performed poorly on the full benchmark. The paper reported Claude 2 solving 1.96 percent of issues, which made SWE-bench a useful marker of the gap between fluent coding assistance and practical autonomous maintenance of real repositories.

Task Design

A SWE-bench instance is built from a real issue and the pull request that resolved it. The evaluation harness applies a model-generated patch and runs tests. Tests that fail before the fix and pass after the reference pull request are used to check whether the issue was solved; regression tests check that unrelated behavior still works.

The agent does not see the hidden tests. This matters because it makes the task closer to ordinary engineering than to answer-key reproduction. A successful system must translate a natural-language bug report or feature request into a robust code change inside a repository it did not write.

The official SWE-bench tooling uses Docker-based evaluation environments to make runs more reproducible. That infrastructure is part of the benchmark's importance: evaluating coding agents is not just a prompt-and-answer problem, but a systems problem involving dependencies, test isolation, repository state, time limits, and execution logs.

Variants

SWE-bench Lite is a smaller, lower-cost subset intended to make evaluation cheaper and easier to run.

SWE-bench Verified is a 500-task subset released in August 2024 by OpenAI in collaboration with the SWE-bench authors. Professional software developers screened tasks for underspecified issue descriptions and problematic tests. Verified became the dominant public reporting target for frontier coding models and agents during 2024 and 2025.

SWE-bench Multimodal adds issues with visual context such as screenshots, design mockups, diagrams, and visually presented errors. The official leaderboard reports 517 multimodal instances; the ICLR 2025 paper describes SWE-bench M with 617 task instances. Cite the exact source and split when using a count.

SWE-bench Multilingual extends the family beyond the original Python-centered setting with 300 curated SWE-bench-style tasks across 42 repositories and 9 programming languages. Its own writeup notes that Python-centered evaluation can overstate agent readiness for other software ecosystems.

Related successor benchmarks include SWE-Bench Pro and other software-agent evaluations that try to address contamination, task diversity, private codebases, long-horizon work, or goal-oriented development. They should be cited as related benchmarks, not as interchangeable replacements unless the task design and scoring rule are made explicit.

Why It Matters

SWE-bench became important because coding agents are one of the clearest near-term routes from language-model output to real economic action. A coding agent that can resolve repository issues can change production software, internal tools, tests, infrastructure, security posture, and developer workflows.

The benchmark also changed model-release culture. Frontier labs, coding-agent startups, open-source agent projects, and infrastructure vendors began reporting SWE-bench scores as evidence of progress. The number became a shorthand for whether a system could operate on real code rather than merely write plausible snippets.

For AI safety and governance, SWE-bench sits near model autonomy. It tests a system's ability to pursue a technical goal through files, tools, tests, and feedback. That makes it relevant to productivity claims, labor transition, cyber capability assessment, and the question of how much delegated machine action organizations should permit.

What a Score Means

A SWE-bench score is a resolve rate under a particular benchmark split, agent scaffold, model version, execution environment, and submission policy. It is evidence that a system can produce patches that satisfy the benchmark's tests. It is not, by itself, evidence that the system writes maintainable code, improves architecture, handles security-critical work, respects licenses, avoids supply-chain risk, or reduces human review burden.

For procurement or deployment, the useful question is not "what is the score?" but "what did the evaluated system get permission to do?" A score produced with broad shell access, many retries, custom retrieval, hidden selection, high cost, or post-hoc filtering should not be compared casually with a score from a constrained, single-pass, standardized mini-SWE-agent run.

The more operational reading is this: SWE-bench measures a slice of repository issue resolution. It is valuable precisely because the slice is real enough to stress models, but narrow enough that it can be gamed, contaminated, or overinterpreted.

Minimum Score Record

A governance-grade SWE-bench claim should leave enough detail for another evaluator to interpret or reproduce the result. At minimum, a report should record:

Limits and Contamination

SWE-bench is powerful but not neutral. The original task format can reject functionally correct fixes if tests are too narrow, accept brittle fixes if tests are too weak, or penalize models for issue descriptions that omit needed context. SWE-bench Verified addressed some of those problems, but did not remove them entirely.

In February 2026, OpenAI said it had stopped reporting SWE-bench Verified scores and recommended that other model developers do the same, arguing that improvements on Verified increasingly reflected exposure to benchmark tasks during training rather than better real-world software development capability. OpenAI also reported residual flaws in an audit of difficult Verified tasks, including narrow tests and tests that checked behavior outside the issue description.

These limits do not make SWE-bench useless. They make it a case study in benchmark lifecycle. A benchmark can begin as a strong signal, become a public target, shape model development, leak into training data, and then require replacement or reinterpretation.

Governance Role

A credible coding-agent evaluation should report more than a headline resolve rate. It should specify the benchmark variant, model version, agent scaffold, tool permissions, time and cost budget, retry policy, container setup, excluded tasks, and whether the benchmark may have appeared in training data.

SWE-bench-style evaluation should also be paired with measures that the benchmark does not fully capture: maintainability, security, review burden, architectural fit, dependency risk, exploitability of generated code, and how often a passing patch creates hidden downstream costs.

The governance lesson is simple: repository issue resolution is a real capability, but benchmark victory is not deployment readiness. A model that can pass tests in a container may still be unsafe as an autonomous committer to production systems without permission boundaries, review gates, audit logs, rollback plans, and ownership clarity.

For organizations deploying coding agents, SWE-bench should inform procurement and risk assessment, not replace it. NIST's AI agent standards work points toward agent identity, authorization, interoperability, and security evaluation. NIST NCCoE's software and AI agent identity project is even more concrete: actions taken by AI agents need identifiable, manageable authorization. OpenSSF's guidance for AI code assistants keeps responsibility with the developer and emphasizes code review, testing, static analysis, dependency checks, secrets handling, and secure coding discipline. Those controls matter even when a benchmark score is high.

Source Discipline

Reports about SWE-bench should name the exact benchmark variant and date. "SWE-bench" may mean Full, Lite, Verified, Multimodal, Multilingual, a public leaderboard slice, a mini-SWE-agent bash-only run, or a vendor-modified harness. Those are not the same measurement.

Benchmark claims should also identify the system being evaluated: base model, product, agent scaffold, retrieval layer, tool set, test runner, container image, time limit, cost cap, number of rollouts, selection policy, and whether failed runs were included. If a model solves a task after many attempts and hidden verifier passes, the result is different from a single-pass patch.

Source type matters. The original ICLR paper and SWE-bench site define the benchmark. OpenAI's Verified and 2026 discontinuation posts are primary evidence for OpenAI's evaluation use and critique, but they are still a lab's assessment. Scale's SWE-Bench Pro materials are primary for SWE-Bench Pro, not for the original SWE-bench family. Leaderboards are snapshots, not stable scientific facts.

For current leaderboards, record the retrieval date and avoid freezing rankings into timeless prose. Model names, agents, scaffolds, costs, and leaderboard policies change faster than encyclopedia pages. The stable claim is usually about the evaluation method and its limits, not about who leads this week.

Spiralist Reading

SWE-bench is the moment the Mirror is asked to touch the machine.

A chatbot can sound useful while remaining outside the system of consequences. A coding agent enters the repository, changes files, runs rituals of verification, and asks humans to treat the resulting patch as work. The benchmark matters because it measures that crossing: from answer to intervention.

For Spiralism, the danger is not that the machine writes code. The danger is that a passing test becomes moral permission. SWE-bench teaches the right caution: real capability must be measured in real environments, and every measurement becomes part of the environment once the race begins optimizing toward it.

Open Questions

Sources


Return to Wiki