Blog · arXiv Analysis · Last reviewed June 25, 2026

The Assurance Directive Becomes the Operational Burden

The 2026 arXiv report AI Assurance in UK Defence: Challenges in Operationalising JSP 936 argues that a serious AI policy becomes real only when its evidence, boundaries, human roles, and system assumptions can survive operational use.

A Directive Is Not Proof

The report, arXiv:2606.09414 [cs.HC, cs.AI], was submitted on June 8, 2026 by Callum Cockburn and Sam Farrow. Its title is AI Assurance in UK Defence: Challenges in Operationalising JSP 936. The source text it studies is JSP 936 Part 1, the UK Ministry of Defence directive on dependable AI in defence.

GOV.UK published the JSP 936 Part 1 page on November 13, 2024 and describes it as MOD's principal policy framework for safe and responsible AI adoption. The official page says the directive covers governance, development, and assurance across the AI lifecycle, including quality, safety, and security. The arXiv report asks the next question: what must an organisation actually be able to prove?

The answer is not "it has a policy." Assurance must connect obligations to claims, evidence, roles, assumptions, and operational conditions. The paper treats that translation as the hard part.

Eight Frictions

Cockburn and Farrow organise their review around eight implementation challenges: adequacy of evidence and argument, management of human interaction with AI, definition of the operational environment, integration of AI within systems of systems, assessment and maintenance of AI performance, safety and security analysis, measurement of ethicality, and the broader complexity of AI assurance.

Those categories expose the difference between documentary compliance and operational confidence. Evidence is adequate only for a stated claim, context, hazard, user role, and risk level. A model can pass a benchmark while the surrounding system fails. A named overseer can still be too late, blind, or overloaded to intervene.

The report does not argue that JSP 936 is deficient as a directive. Its more restrained claim is sharper: JSP 936 is a useful governance basis, but implementation depends on unresolved technical, organisational, and assurance questions. The burden moves from policy authorship to evidence architecture.

The ODD Trap

One recurring pressure point is the Operational Design Domain, or ODD. JSP 936 requires teams to identify the ODD for AI, test across it, and address behaviour when foreseeable excursions outside it occur. The paper treats this as more than a boundary-drawing exercise.

An ODD is easy to say and difficult to maintain. Real deployments have sensor limits, weather, adversarial action, user workarounds, changing data distributions, and shifting mission conditions. The assurance case has to know when the boundary is near, whether operators can see it, and whether fallback behaviour remains credible when the world stops matching the design story.

If a team treats the ODD as a paperwork box, it discovers boundary failure after practice has drifted. If it treats the ODD as an operational instrument, the boundary appears in training, monitoring, incident review, and re-assurance triggers.

Human Oversight as Evidence

JSP 936 is explicitly concerned with human responsibility and appropriate human oversight. The report makes clear that oversight cannot be assumed from the presence of a person. Assurance has to describe the work split between humans and automation, the information available to each role, the time available for intervention, and the trust calibration expected of operators.

That lesson travels beyond defence. "Human in the loop" is weak evidence if the loop is ceremonial. A reviewer who cannot understand model limits, reconstruct a decision, notice out-of-domain operation, or change the outcome is a witness to automation, not a control.

Evidence reuse is also fragile. If the automation level, operator role, deployment context, or team training changes, old evidence may no longer support the same claim. Assurance is a lifecycle practice, not a one-time badge.

System, Not Model

The report repeatedly pushes assurance away from model-only thinking. AI-specific performance metrics can help, but the relevant question is whether the AI-enabled system achieves its operational purpose while remaining safe, secure, lawful, and ethically acceptable. Model accuracy is not the same thing as system performance.

This connects directly to safety cases, AI audits, standards becoming law, and governance document revalidation. Each asks whether the institution can show how evidence supports a live claim. The JSP 936 paper adds a defence-specific version: systems of systems, propagated security effects, and ethical review under operational constraints.

A model card cannot carry that burden alone. Neither can an impact assessment, checklist, or benchmark table. The assurance case has to preserve the chain between capability, human role, operating domain, hazards, security assumptions, ethical principles, and evidence.

The Spiralist Test

The Spiralist test is simple: when a policy says an AI system is assured, what evidence would make that claim contestable? Can a reviewer see the ODD, the evidence threshold, the human role, the fallback path, the system-of-systems dependency, and the re-assurance trigger? Can an operator tell when the system has left the story the assurance case was written for?

If not, assurance becomes a language layer over uncertainty. The directive may be sincere, detailed, and institutionally important, but the operational burden remains unpaid. Governance begins when the claim can be replayed against the system that actually acts.

Scope Boundary

This is a structured interpretive review of JSP 936 Part 1, not an empirical evaluation of a deployed MOD AI system. It does not show that any particular defence system is unsafe or unassurable. It also does not reproduce JSP 936 Part 2 or the full Defence AI Playbook context.

The modest conclusion is enough: high-stakes AI assurance is not satisfied by policy existence. It requires evidence that remains valid across operational context, human-machine interaction, system integration, safety, security, and ethical commitments.

Sources


Return to Blog