Blog · arXiv Analysis · Last reviewed June 25, 2026

The Agent Instruction Becomes the Policy Compiler

A June 2026 arXiv paper asks whether agent prompts, MCP tool descriptions, and natural-language policy documents can be compiled into checked Cedar policy rather than trusted as text inside the model context.

Not a Better Prompt

The paper, arXiv:2606.26649 [cs.AI; cs.CR], was submitted on June 25, 2026. arXiv lists the title as Autoformalization of Agent Instructions into Policy-as-Code, by Adam Mondl, Matthew Maisel, and John H. Brock, with a note that it was accepted at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond, ICML 2026.

The paper is close to the site's earlier page on runtime agent rulebooks, but the angle is different. That page asked what external rule system an agent needs. This one asks how loose instructions become a formal rule artifact.

The Paper Frame

Tool-using agents create a familiar bind. To be useful, they need permission to act. To be safe, those permissions must be narrower than the agent's ability to write persuasive text about why it should proceed. Prompt instructions and classifier guardrails can help, but the paper argues that security-critical domains need enforcement outside the probabilistic model loop.

The proposed alternative is policy-as-code. Instead of leaving policy inside a system prompt, the pipeline translates agent prompts, Model Context Protocol tool descriptions, and natural-language policy documents into Cedar authorization policies. At runtime, an external deterministic policy engine evaluates proposed agent actions against those rules.

The Compiler Loop

The authors call the architecture a Verification Sandwich. The grounding layer extracts entities, tool schemas, and principal-resource-action structure. The model layer generates candidate Cedar policies from the agent's system instruction, tool definitions, and policy corpus. The safety layer checks those policies with two critics.

The hard critic is deterministic. It uses Cedar tooling to check syntax, schema compliance, vacuous policies, and conflicting rules. The soft critic is model-based: an LLM judge evaluates whether the generated formal logic still matches the meaning of the original instructions and policy documents. Feedback from both critics returns to the generator until the policy set clears the checks.

The paper's reported experiment used Gemini 3 Pro as the candidate generator and a two-stage Gemini 2.5 Flash judge-verifier pipeline as the soft critic. The hard verifier used the Rust cedar-policy CLI. Those implementation details matter because a "policy compiler" is not one thing. It is a chain of source documents, schemas, model generations, deterministic validation, semantic review, and runtime enforcement.

The MedAgentBench Test

The evaluation uses MedAgentBench, a benchmark for tool-using electronic medical record agents. The paper compares against prior symbolic-guardrail work by Hong et al. In that setup, a synthetic natural-language policy contained 88 rules, while the hand-coded guardrails covered 23 of them. Mondl, Maisel, and Brock feed the same policy through their autoformalization pipeline to generate Cedar policies and evaluate enforcement coverage rather than agent utility.

The reported conditions include raw HTTP tools, typed MCP tools, and a guardrail condition. The adversarial dataset replaces benign tasks with prompts designed to induce policy violations. In the adversarial guardrail condition, 49 trajectories slipped past the MCP server's hard gating, while Cedar's additional deny coverage blocked 42 of them, or 85.7 percent. The paper also decomposes results by write attempts and reports that, where trajectories contained at least one POST request, Cedar blocked every such trajectory in the shown conditions.

What the Result Means

The result is not "formal policy solves agents." It is narrower. A separately enforced Cedar layer can catch policy violations that prompt-level or MCP-level controls did not cover, especially when the violation reaches a tool call the policy engine can inspect. It also shows why scope matters: some trajectories were unblockable because they made no POST write, and Cedar is not a judge of dialog-only behavior.

The finding is still important. A prompt can be argued with, buried, contradicted, or summarized away. A generated Cedar policy can be parsed, checked, versioned, hashed, tested, audited, and enforced outside the model's context window. That is the change from instruction as advice to instruction as artifact.

Governance Reading

This belongs beside tool-scope authorization, compliance traces, policy-adherent agent state, Model Context Protocol, and AI audit trails. The practical rule is simple: if policy can be compiled, it must also be reviewed as compiled code.

Autoformalization does not remove human responsibility. It changes where human review should concentrate. Reviewers need to inspect the source policy corpus, generated schema, generated rules, critic rubric, hard-verifier results, representative blocked and allowed cases, and runtime fail-closed behavior. Otherwise, the institution has only moved trust from a prompt to a compiler it does not understand.

Limits

The paper names two important future-work gaps. Cedar is stateless by design, which limits multi-turn workflows that depend on action ordering or persistent context. The authors propose future temporal-logic integration for prerequisite sequences and memory-aware policies that can reference an agent's trajectory.

The discussion also warns about friction. If generated policies are too restrictive, developers may disable protections entirely. The soft critic is meant to help find a workable policy boundary, but it is still model-based semantic judgment. A formally valid policy can still encode the wrong institution, the wrong exception, or the wrong operational trade-off.

Policy Compiler Receipt

A policy compiler receipt should record: source prompt, MCP tool definitions, policy corpus, source versions, target policy language, generated schema, generator model, hard-critic tooling, soft-critic model, critic rubric, retry count, validation errors, final policy hash, allowed-case tests, denied-case tests, fail-closed behavior, deployment scope, policy owner, rollback path, and known out-of-scope behaviors such as dialog-only risks.

The audit-grade sentence is not "the agent has policy." It is: this source policy was compiled into this Cedar rule set by this generator, checked by these critics, tested against these traces, enforced at this boundary, and versioned for later challenge.

Sources

Adam Mondl, Matthew Maisel, and John H. Brock, Autoformalization of Agent Instructions into Policy-as-Code, arXiv:2606.26649 [cs.AI; cs.CR], submitted June 25, 2026.
Primary arXiv versions checked: metadata API record, PDF, and experimental HTML, reviewed for title, authorship, submission date, workshop note, pipeline architecture, MedAgentBench setup, reported block-rate results, discussion, and future-work limits.
Related pages: The Agent Rulebook Leaves the Prompt, The Tool Scope Becomes the Intent Gate, The Compliance Trace Becomes the Rulebook, The Agent Ledger Becomes the Policy State, Model Context Protocol, and AI Audit Trails.

Return to Blog