Blog · arXiv Analysis · Last reviewed June 25, 2026

The Governance Policy Becomes the Mechanical Gate

A policy prompt is not the same as a constraint. Regulated AI decisions need evidence that governance changed the decision path, not only fluent policy language after the fact.

Policy Is Not Enforcement

Governance fails quietly when the same model interprets a rule, decides whether it has been met, and writes the rationale that claims compliance. A deferral can have the right form while saying nothing useful about which data is missing, why it matters, or what would resolve the case.

A natural-language policy can guide a model. It does not, by itself, remove a clear-cut case from model discretion. For regulated decisions, the audit question is not "did the model mention the policy?" It is "which part of the system forced the decision boundary and left a record that a human can inspect?"

The Paper

arXiv lists Mechanical Enforcement for LLM Governance: Evidence of Governance-Task Decoupling in Financial Decision Systems as arXiv:2605.14744v1 [cs.CL], submitted May 14, 2026. The authors are José Manuel de la Chica Rodríguez and Carlos Martí-González. The paper is identified as a Santander AI Lab conceptual report and lists Santander AI Lab, Grupo Santander.

The study uses a synthetic banking-style decision domain rather than production customer data. Cases include risk score, information completeness, transaction amount, jurisdiction, and regulatory flags such as AML, KYC, SANCTIONS, INSIDER, and CONCENTRATION. The paper reports N = 300 cases per condition across two governance regimes and four stress conditions, for 2,400 regime-condition cases.

Two Governance Regimes

The first regime, R1, is text-only governance. The model receives a governance policy as a system prompt and self-interprets the policy while producing one of five structured decisions: APPROVE, CONDITIONAL, ESCALATE, DEFER, or DECLINE. This is the familiar arrangement in which policy exists inside the same interpretive loop as the decision.

The second regime, R2, keeps the policy but adds four mechanical primitives outside the model's interpretive loop. Hard gates enforce boundaries on structured variables. I6Q applies rationale-quality constraints. CEFL externalizes candidate generation before scoring. E3 commits the entropy seed before scoring through a commit-reveal pattern, so randomization cannot be observed and exploited during the decision.

This is a narrow architectural claim: where a decision can be resolved by structured governance rules, the model should not be the component that decides whether those rules bind.

What Was Measured

The paper measures governance at the rationale level, not only at the task-output level. Cosmetic Deadlock Rate (CDL) measures deferrals that are formally present but informationally empty. Deferral Information Utilisation (DIU) measures decision-relevant information. Framing Success Rate (FSR) measures susceptibility to framing. Failure Visibility Score (FVS) measures whether failures remain visible. Entropy Sensitivity Differential (ESD) measures sensitivity to randomization choices.

The task side is measured separately with ordinary classification metrics, including Matthews correlation coefficient. That separation matters. A system can make more correct task decisions while still producing weak governance evidence, or it can preserve governance quality while choosing to defer rather than force a confident but poorly supported classification.

The Decoupling Result

In the baseline condition, the paper reports that text-only governance produced vacuous deferrals in 27 percent of deferral cases, with CDL = 0.273. Mechanical enforcement reduced CDL to 0.074, a 73 percent reduction, and raised DIU from 0.298 to 0.766. Task performance improved at the same time: MCC rose from 0.433 under R1 to 0.884 under R2, with reported accuracy rising from 0.422 to 0.909.

The more important result appears under the LowInfo structural stress condition. R2 had its best governance scores there, CDL = 0.088 and DIU = 0.852, while its task accuracy metric dropped to its worst reported MCC, 0.285. The paper calls this governance-task decoupling. Governance quality and task performance are not the same axis. Under stress, a system may preserve a useful review trail precisely by deferring more carefully.

The ablation study supports the mechanism. Removing I6Q raised CDL by 47 percent, from 0.074 to 0.109. Disabling the DEFER channel produced the lowest Failure Visibility Score, 0.500. The paper also notes that the aggregate improvement came from the mechanical component that removed some cases from model control.

Governance Reading

This page belongs beside adverse-action explanations, AI in finance, agentic model validation, and AI audit interfaces. The fresh contribution is the boundary between a policy that the model reads and a rule that the system enforces before or around the model.

For a regulated workflow, an accuracy table is not a governance file. The record should show which cases were intercepted by hard gates, which ones were left to model discretion, what rationale-quality checks ran, and how a human reviewer could resolve a deferral. A model-risk committee should be able to inspect the boundary without accepting the model's prose as proof that governance worked.

The practical rule is simple: keep natural-language policy for explanation, training, and human legibility; put enforceable thresholds, audit receipts, and deferral requirements in the surrounding system.

Limits

The paper is careful about scope. The domain is synthetic banking, not a deployed financial-decision system. The experiments use a single model family, identified as Llama 3.1 70B Instruct via AWS Bedrock with deterministic inference and seed 42. The 40/60 deterministic-to-ambiguous case split is a modelling choice. Generality would require cross-model validation and deployment-scale testing.

Mechanical gates also do not solve every governance problem. The paper reports that both regimes remained susceptible to framing, and that the LLM-decided majority still needed further work. The value of the paper is narrower and stronger: it shows why governance should be measured separately from task performance, and why some policy obligations should be enforced outside the model's interpretive loop.

Mechanical Gate Receipt

A mechanical-gate receipt should record: policy version, gate thresholds, structured input variables, missing fields, regulatory flags, model and prompt versions, candidate-generation record, entropy commitment, I6Q score, CDL and DIU measurements, deferral reason, human reviewer action, and override authority.

Sources


Return to Blog