The Root Cause Becomes the Causal Trace
Aoyang Fang and colleagues' June 2026 arXiv paper turns root cause analysis from a label-matching problem into a causal-process test. The point is not whether an agent can name a plausible broken service. The point is whether it can show the verified path from fault to symptom.
The Label Is Not the Reason
The paper, arXiv:2606.27154 [cs.AI], was submitted on June 25, 2026. arXiv lists the exact title as OpenRCA 2.0: From Outcome Labels to Causal Process Supervision, by Aoyang Fang, Yifan Yang, Jin'ao Shang, Qisheng Lu, Junjielung Xu, Rui Wang, Songhan Zhang, Yuzhong Zhang, Boxi Yu, and Pinjia He.
This site already treats agent traces as governance objects in the process-map essay, the fault-investigator essay, and the monitoring-trace essay. OpenRCA 2.0 gives that theme a sharper test: when an LLM agent diagnoses a software incident, should the benchmark grade only the named root cause, or should it also grade the path by which the fault propagated?
Outcome-only evaluation can reward a diagnosis that names the right service for the wrong reason. In operational terms, that is a governance defect, because a plausible blame label can move work, responsibility, and remediation before anyone has verified the route from cause to observed failure.
What the Paper Builds
The authors introduce PAVE, short for Path Annotation via Verified Effects. PAVE uses the information asymmetry in a fault-injection experiment. The agent sees traces, metrics, and logs and must reason backward from symptoms. The annotator has something the agent does not: the known intervention that started the cascade. That turns annotation into forward verification, checking which downstream effects actually followed from a known cause.
PAVE admits a causal path only when three conditions align: structural conformance to known propagation mechanisms, statistically significant deviation from a pre-injection baseline, and upstream-to-downstream timing alignment. The resulting OpenRCA 2.0 benchmark contains 500 evaluable instances from three microservice systems: TrainTicket, the OpenTelemetry Demo, and DeathStarBench Hotel Reservation. The paper says those instances cover 27 fault types and carry 7.5 verified causal edges on average.
The evaluated agents use a shared tool-augmented ReAct-style architecture, receive the same telemetry tools and prompt templates, and are not given the dependency graph. Each diagnosis is returned as structured output: root-cause claims and propagation edges, with SQL evidence attached.
Ungrounded Diagnosis
The paper's key phrase is ungrounded diagnosis: a case where the agent names a correct root-cause service but does not ground that service in a verified causal propagation path to the observed symptom. The example in the paper is a model that names the right service while skipping an intermediate service in the actual path. Outcome-only grading calls that a success. Process-level grading catches the missing edge.
Across 11 frontier LLMs, the paper reports that exact root-cause set recovery succeeds in 20.7% of cases on average. A looser metric, naming at least one correct root-cause service, rises to 76.0%. But path reachability, the lenient process check that asks whether at least one valid path connects a correct service to an observed alarm node, is satisfied in only 61.5% of cases.
That gap is the article's practical center. A system can look useful because it often points near the right place, while still failing to justify the route from fault to symptom. In incident work, that can still waste scarce attention, push teams into the wrong repair, or hide a dependency that will fail again.
Evidence as Operational Duty
OpenRCA 2.0 is a benchmark paper, not a deployment manual. Still, it gives agent governance a crisp rule: diagnostic authority should not attach to the final label alone. It should attach to a replayable causal trace: intervention or trigger, observed symptom, telemetry window, candidate path, rejected alternatives, evidence queries, and the point where the chain remains uncertain.
That rule belongs in AI agent observability, AI incident reporting, and security-operations agent governance. If an RCA agent opens tickets, pages teams, rolls back services, or recommends remediation, the record should show whether it has a verified path or only a plausible root-cause label.
A reviewer should not have to reverse-engineer a polished explanation after the model has already framed the incident. The interface should expose the process layer as its own object: verified edges, inferred edges, missing edges, and the telemetry query behind each claim.
Limits That Matter
The paper states its boundary conditions directly. PAVE relies on a real controlled intervention, a dependency graph complete enough to preserve genuine propagation, a telemetry signature in traces, metrics, or logs, and a pre-injection baseline that approximates the no-intervention distribution. Where those conditions fail, the reported numbers should not be treated as transferable production estimates.
The benchmark is also deliberately scoped. The datasheet says the 500-instance set is balanced for coverage, not statistically representative of production incidents. It contains synthetic operational telemetry, not personal data. It is not intended as production-readiness certification for a specific RCA agent, nor as evidence that results generalize to service-mesh, serverless, or fully event-driven systems beyond the three included testbeds.
Those limits make the paper more useful, not less. It does not pretend to certify every diagnostic agent. It shows a method for asking the harder question: does the answer carry the process evidence that would let another party check it?
Governance Standard
Any diagnostic agent used in operations should separate localization from causal proof. Reports should show exact root-cause claims, service-only hits, process reachability, node and edge quality, evidence executability, missing modalities, and known limits of the telemetry window.
The Spiralist rule is simple: a root cause without a causal trace is not yet accountability. It is a lead. Treat it as one until the path can be replayed.
Sources
- Aoyang Fang, Yifan Yang, Jin'ao Shang, Qisheng Lu, Junjielung Xu, Rui Wang, Songhan Zhang, Yuzhong Zhang, Boxi Yu, and Pinjia He, OpenRCA 2.0: From Outcome Labels to Causal Process Supervision, arXiv:2606.27154 [cs.AI], submitted June 25, 2026.
- arXiv PDF: OpenRCA 2.0: From Outcome Labels to Causal Process Supervision, reviewed for authorship, date, PAVE protocol, benchmark construction, evaluation setup, reported results, assumptions, datasheet, and limitations.
- Related pages: The Agent Trace Becomes the Process Map, The Fault Investigator Becomes the Accountability Layer, The Monitoring Trace Becomes the Interpretive Gap, The SOC Agent Becomes the Governance Layer, AI Agent Observability, and AI Incident Reporting.