The Evidence Layer Becomes the Governance System
Vishal Srivastava and Tanmay Sah name a quiet failure mode in AI governance: organizations ask whether a system is safe, fair, reliable, compliant, or valuable before asking whether the evidence is strong enough to support that decision.
The Paper
The paper is The AI Evaluability Gap: The Missing Layer for Managing Risk and Sustaining Value, arXiv:2606.21015 [cs.AI], by Vishal Srivastava and Tanmay Sah. arXiv records version 1 as submitted on June 19, 2026. The arXiv API summary describes it as a conceptual framework paper on evidence sufficiency for AI governance, with operational certification, investment certification, and a six-property evidence lifecycle.
The useful move is not another score for model quality. The paper shifts the unit of governance from the property of the system to the warrant behind a decision. Instead of asking only whether an AI system is safe or valuable, the framework asks whether the organization has enough current, attributable, verifiable evidence to make a high-confidence decision about safety, value, reliability, fairness, compliance, or continued funding.
That distinction matters because many governance artifacts look complete while their evidence base is thin. A dashboard may show accuracy. A model card may list evaluations. A risk register may assign severity. A finance deck may claim productivity. None of those artifacts proves that the available evidence can bear the decision being placed on top of it.
The Evaluability Gap
Srivastava and Sah define evaluability as a system's capability to generate, maintain, and renew evidence sufficient to support high-confidence governance decisions over time. The paper formalizes those decisions around calibrated confidence, written in the paper as Conf(D | E): confidence in decision D given evidence E.
The gap appears when an organization has signals but not warrant. There may be telemetry without attribution, a benchmark without deployment context, an audit without independent reproduction, or a business-value claim without causal evidence that the AI system produced the observed gain. The organization then slides into two opposite errors: over-deployment when risk is not measured well enough, or under-investment when value is not measured well enough.
This frame fits the site's recurring concern with AI evidence rituals. A system card can become a release ceremony. A third-party audit can become a compliance interface. A safety case can become a release gate. The evaluability question cuts underneath them: what evidence would make the decision defensible, and is that evidence still alive?
Two Certifications
The paper separates two classes of governance decision. Operational certification asks whether a system may operate within an acceptable envelope of risk, fairness, reliability, and compliance. The evidence is mostly structural: specifications, verification, stress testing, scenario analysis, adversarial evaluation, monitoring, and review.
Investment certification asks whether the system still deserves organizational resources. The evidence is mostly causal: counterfactual reasoning, randomized rollout, difference-in-differences, instrumental-variable estimation, adoption analysis, productivity evidence, and business-impact measurement. The point is not that value outranks risk. The point is that value claims need their own evidentiary discipline instead of floating beside risk governance as a sales story or budget ritual.
That split helps explain why AI programs can feel simultaneously over-controlled and under-governed. One committee may require policy attestations before launch. Another may approve expansion because a metric improved. If neither committee can say how strong the supporting evidence is, both are managing shadows.
Six Evidence Properties
The framework names six properties of evaluable evidence: observability, attributability, intervenability, verifiability, calibration, and temporal validity. The paper treats them as a cycle rather than a checklist. Observability asks what happened. Attributability asks whether the AI system caused it. Verifiability asks whether an independent reviewer can inspect or reproduce the evidence. Calibration asks how confidence should be stated. Temporal validity asks whether the evidence still applies. Intervenability asks what action is available when confidence decays.
This is a more useful governance vocabulary than "trustworthy AI" because each word implies a testable failure. A system can log events but fail attribution. It can estimate value but lack calibration. It can pass a review in March and become stale in June because the model, users, environment, law, or workflow changed. Temporal validity is especially important for deployed AI because evidence decays while the system keeps acting.
The paper also warns that observed performance alone is not enough. Its hospital example compares two deployments with the same diagnostic model and the same reported retrospective accuracy. One deployment has randomized rollout, override logging, drift monitoring, and audit trail; the other has only organic use and an accuracy figure. The same apparent performance can support different governance decisions when the evidence architecture differs.
The Governance Receipt
The practical receipt is simple to state and hard to build. For every high-stakes AI decision, the institution should record the decision, the evidence used, the confidence threshold, the owner of the evidence, the method of attribution, the independent verification path, the calibration record, the age of the evidence, and the intervention available if the evidence fails.
For deployment, the receipt says: this system may operate under these conditions because these structural tests, monitoring channels, and escalation powers support the decision. For continued investment, it says: this system deserves resources because this causal evidence supports its value claim at this confidence level. Without those receipts, governance becomes a vocabulary layer over guesswork.
The Spiralist reading is deliberately secular. The evidence layer is not an oracle. It is a discipline for refusing enchanted dashboards. It tells the institution to stop asking whether a system feels advanced and start asking whether the decision about the system can survive contact with the record.
Limits
The paper is conceptual. Its limitations section says the framework has not been empirically validated against deployed enterprise governance outcomes, does not specify a single form for confidence aggregation, leaves cross-domain aggregation open, does not solve adversarial manipulation of evidence, and sketches institutional architecture rather than specifying an operating model. Those limits are substantial. They also make the paper more useful: it names the missing layer without pretending to have finished the machinery.
Sources
- Vishal Srivastava and Tanmay Sah, The AI Evaluability Gap: The Missing Layer for Managing Risk and Sustaining Value, arXiv:2606.21015 [cs.AI], submitted June 19, 2026.
- Primary arXiv records checked: arXiv API metadata, HTML full text, and PDF, reviewed for title, authorship, submission date, category, abstract claims, definitions, certification split, six evidence properties, hospital example, standards comparison, and limitations.
- Related pages: The System Card Becomes a Release Ritual, The AI Audit Becomes the Compliance Interface, The Safety Case Becomes the Release Gate, The Governance Document Becomes the Revalidation Artifact, and AI Evaluations.