Blog · arXiv Analysis · Last reviewed July 2, 2026

The Agentic Code Failure Becomes the Governance Substrate

James C. Davis, Paschal C. Amusuo, Tanmay Singla, Berk Çakar, and Kirsten A. Davis's July 2026 arXiv paper argues that AI coding-agent velocity does not remove engineering judgment. It makes judgment the scarce part of the system.

For this essay, a governance substrate is the collection of architecture, typed boundaries, lints, tests, dispatch rules, documentation, provenance records, and deployment gates that lets coding agents do fast work without making every line depend on human inspection.

The Claim

The paper, arXiv:2607.01087 [cs.SE; cs.AI], was submitted on July 1, 2026. arXiv lists the title as Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering.

The useful claim is not simply that coding agents can generate large amounts of software. The paper's stronger claim is that agent-generated implementation becomes durable progress only when recurring failures are converted into durable governance. A failed patch is not just a defect. It can be evidence that the system lacks a boundary, an oracle, a model, a lint, a dispatch rule, a typed vocabulary, or a gate.

That makes the bottleneck different. If implementation becomes cheap, the expensive part is deciding what failures mean and which controls should exist next. The engineer's job shifts from reading every generated line toward shaping the environment in which future generated lines are possible.

The Paper Frame

The authors position the work against three agentic software-engineering models. A velocity-centric model gives agents more autonomy and throughput, but can leave reliability underspecified. An oversight-centric model keeps humans near the implementation loop, but human attention becomes the throughput limit. A governance-centric model encodes known obligations before agent work begins.

The paper adds an ex-post version of governance. Some controls cannot be fully specified in advance because the failure class becomes visible only after agents start producing plausible but wrong changes. The authors call this process governance conversion: converting observed agentic failure patterns into explicit, durable governance that constrains later agent work.

The Case

The case is a 12-week first-person study in which one expert software engineer used frontier coding agents, specifically Claude through a VS Code chat interface, to build a document accessibility remediation system. The target system processed Office and PDF documents under U.S. public-sector accessibility requirements and had to support auditable transformations rather than black-box whole-document generation.

The empirical record is substantial for a single case. The authors report 88 contemporaneous field-note entries, 18,662 commits, about 1.6 million lines of active artifacts, 420 KLOC of production code, and 1.16 MLOC of tests, lints, supporting documentation, agent infrastructure, and tooling. The subject inspected almost no agent-produced code directly. Instead, the case tested whether quality could be sustained through a governed engineering environment.

The paper also gives the cost and workload boundary. It reports roughly $60K in total development cost, including salary, inference, cloud hosting, and Claude subscriptions, and estimates 9-18 million tokens per week during the study period. Those numbers matter because "agentic velocity" is not free magic. It is a purchased runtime regime attached to human direction, compute spend, cloud infrastructure, and fatigue.

Governance Conversion

The paper's central loop is clear: velocity exposes a failure class; a human architect classifies the failure as local or structural; structural failures are converted into governance; later agents inherit a narrower and more explicit action space; governability compounds.

One example concerns architecture. Early in the project, the subject relied on agents to audit other agents' work against architectural zones and invariants. As the repository grew, whole-repository audits no longer fit in context, audits became expensive and nondeterministic, and structural drift accumulated. The response was not another reminder in a prompt. The subject reified zones into a typed component catalog, turning an agent-mediated audit into deterministic enforcement.

A second example concerns dispatch context. Agents made plausible edits that violated local lints, component boundaries, conventions, or tests because the relevant constraints were scattered across the repository. The response was dynamic context injection: mapping intended files to the precise constraints that governed them and inserting that subset into the agent's task brief before editing began. The lesson is that better prompting can mean moving repository knowledge into a selectable control system, not writing a longer generic instruction.

What Was Encoded

The incident analysis is dominated by engineering reflection: 72 of the 88 incidents. Within that class, the largest categories are controls, with 35 incidents, and architecture, with 20 incidents. The paper treats those as two different governance moves. Controls detect failures after they are committed. Architecture eliminates a failure class by construction.

The resulting mechanism catalog includes governance-doc controls, context and dispatch mechanisms, agent observability, resource mediators, incorporation gates, canonical seams, validation and conformance checks, static and dynamic analyses, provenance and attribution stamps, and closed repair vocabularies. The concrete examples are familiar software-engineering objects: brief-linting, role-typed dispatch, typed event buses, sentinel checks, pre-commit hooks, staged deploy gates, WCAG rule engines, property-based tests, fuzzing, and codemod-first repair thresholds.

The scale is the point. The paper reports 41 representative mechanisms across 10 families. The support apparatus was 2.75 times the size of the production code. If that sounds excessive, it is also the paper's empirical warning: when agents make implementation cheap, the invisible cost can move into governance infrastructure.

Governance Reading

The Spiralist reading is that agentic software work should be measured by governed throughput, not by lines changed, tasks completed, or tokens consumed. A coding agent can create activity without creating coherent software. The decisive question is whether the workflow turns repeated failures into artifacts that make future work safer, cheaper, and more inspectable.

This page belongs beside AI Coding Agents, The Coding Agent Becomes the Maintainer, The Agent Log Becomes the Receipt, The Static Structure Becomes the Agent Anchor, and The Agentic Model Becomes the Validation Problem. The shared rule is that an agent's final diff is not enough evidence. The work needs a trace, a boundary, a gate, and a source of authority for deciding what counts as acceptable.

The paper also gives a useful organizational warning. In the case, one person could observe a recurring failure and alter architecture, lints, tests, docs, dispatch rules, and deployment gates. In a real organization, that authority is often split across teams, owners, review boards, platform groups, security teams, compliance functions, and release processes. If the person who sees the failure cannot change the governance substrate, the system may keep asking agents for local patches while the structural failure remains.

Failure Modes

Tokenmaxxing as engineering metric. The organization celebrates agent-hours, tokens, lines, or commits while failing to ask whether those changes became durable, maintainable software.

Prompt-only governance. Rules live in natural-language instructions, but there is no typed boundary, lint, test, permission rule, or gate that can make violation visible.

Local patch addiction. Each agent failure is repaired in place, while the missing abstraction or absent guardrail that caused the family of failures remains unchanged.

Review bottleneck regression. The team adopts coding agents for speed, then makes human line-by-line inspection the only quality mechanism, recreating the bottleneck at a higher volume.

Soft-control saturation. Templates, style guides, and instructions work at small scale, then quietly fail as repositories, agent concurrency, and workflow diversity increase.

Authority fragmentation. The engineer who sees the pattern lacks permission to update architecture, CI, policy, platform rules, or documentation, so governance conversion cannot happen.

Governance debt invisibility. The product appears cheap because the accounting ignores lints, tests, docs, harnesses, monitors, prompts, repair scripts, and human judgment.

Limits

This is a first-person case study with one expert subject, one toolchain, one project, and one institutional context. It is theory-building, not prevalence estimation. The paper does not prove that every team can reproduce the outcome, that non-experts can safely run the same process, or that agentic velocity is generally cheaper than conventional software engineering after governance costs are counted.

The subject also participated in interpreting the development artifacts, which creates risk of selective memory, confirmation bias, and overinterpretation. The authors mitigate this with contemporaneous field notes, repository evidence, multi-author review, and a second-author recoding sample, but those checks do not turn a single case into a population study.

The ethics note matters. The paper reports that sustained agentic development was harmful to the subject, including degraded attention to personal relationships and physical strain from prolonged work sessions. That should not be treated as color commentary. It is part of the governance question: high-velocity tools can make unhealthy work patterns operationally sustainable for longer than they should be.

Governed Throughput Receipt

A governed-throughput receipt for coding-agent work should record: model and toolchain, repository scope, task source, prompt or brief, loaded context, file targets, permissions, sandbox, branch and worktree, architectural constraints, generated diff, tests run, lints run, static analyses, dynamic analyses, failed checks, repair attempts, final acceptance gate, human approval boundary, incident notes, whether a failure was classified as local or structural, any new governance mechanism created, and whether future agents inherit that mechanism.

The audit-grade sentence is not "the agent wrote 10,000 lines." It is: under this toolchain, authority boundary, and governance substrate, the agent-produced change passed these gates, revealed these failures, and produced this durable update to the environment that will constrain future work.

Source Discipline

This page treats the Davis, Amusuo, Singla, Çakar, and Davis paper as a July 2026 arXiv preprint and reads its empirical claims as author-reported case-study evidence. It does not independently validate the repository, cost accounting, accessibility system, field notes, or tool-use logs.

Use the paper for the governance-conversion theory, case design, incident taxonomy, mechanism catalog, and stated limits. Do not use it as proof that coding agents replace engineers, that line-count velocity equals productivity, or that all teams should inspect little generated code. Its strongest lesson is narrower and more useful: agentic velocity has to leave behind governance, or it becomes churn.

Sources


Return to Blog