Blog · arXiv Analysis · Published: June 25, 2026

The Forecast Explanation Becomes the Markov Receipt

Amadeo Tunyi's KARMA paper turns the explanation of a time-series forecast into a conditional record: what recent history was enough, which transitions mattered, and where the estimate is thin.

The Paper

The paper is Global Explanations for Multivariate Time Series Forecasting Models via K-Order Markov Approximations, arXiv:2606.27599 [cs.LG]. arXiv lists it as submitted on June 25, 2026, by Amadeo Tunyi, with DOI 10.48550/arXiv.2606.27599. The arXiv record says it was accepted at the Workshop on Explainable Artificial Intelligence, International Joint Conference on Artificial Intelligence 2026.

The paper introduces KARMA, a method for explaining black-box time-series predictors by building a K-order Markov surrogate. The practical question is simple: how much recent history does the model actually use, what conditional transitions does that imply, and can those transitions be turned into global explanations rather than one-off local attributions?

Sequential Gap

Many familiar explanation methods were built for static feature tables. When they are carried into time-series forecasting, they can treat nearby timestamps as if they were interchangeable independent columns. That is the gap the paper attacks. A forecast model may respond to momentum, delayed effects, recurring regimes, or cross-variable lags. If the audit destroys those relations while asking what mattered, the explanation can become a story about the audit procedure rather than a story about the model.

KARMA's answer is to preserve the sequential frame. It asks whether the predictor's behavior can be approximated by a Markov chain over a discretized history. The method selects a predictively sufficient order K, estimates a transition kernel over that retained history, and then reads explanations from the kernel. This changes the unit of explanation from isolated feature importance to conditional history: given this recent pattern, how does the model distribute the next state?

That shift is small in vocabulary and large in accountability. A static attribution can say that humidity, price, heart rate, or demand mattered somewhere in the input window. A sequential explanation has to say whether the relevant signal was the latest value, the one-step lag, a longer memory pattern, or a cross-variable delay. The difference matters because users often act on the explanation, not only on the forecast. If the explanation points to the wrong time relation, the human repair can be aimed at the wrong part of the system.

Markov Receipt

A Markov surrogate is not only a technical shortcut. It is a receipt format. It records the retained lag length, the discretization scheme, the transition probabilities, the baseline prefix, and the estimated reliability of each region of history space. Those are precisely the objects an operator needs when a forecast explanation is supposed to support a decision rather than decorate a dashboard.

The receipt also makes an important negative claim possible. When KARMA finds that K is smaller than the model's full input window, the paper argues that lags beyond K can receive certified-zero attribution within the chosen approximation error. That is stronger than saying a heatmap looks faint. It says the retained surrogate does not need those older inputs to mimic the model's prediction behavior within the stated tolerance.

The paper's estimation section also matters for deployment. It distinguishes full joint kernels from marginal kernels and describes strategies that trade simplicity against data efficiency. The default sampling strategy restricts attention to histories observed in the training distribution, bounding the effective history support by the length of the series rather than by the full exponential state space. That is not just an optimization detail. It is the difference between an explanation method that looks elegant on a whiteboard and one that can carry a coverage warning in ordinary audit work.

Five Levels

The paper derives a five-level hierarchy from the estimated kernel. The first level ranks source-variable importance. The second level separates importance by lag. The third identifies distinctive historical regimes. The fourth estimates average interventional effects and supports a model-induced causal graph under the paper's assumptions. The fifth reports uncertainty, including where the estimate is thin because histories have low coverage.

This hierarchy is useful because it separates questions that are often collapsed into a single saliency picture. "Which variable matters?" is not the same as "which lag matters?" or "which regime makes the forecast unusual?" or "which part of the explanation is poorly supported by data?" An audit trail that keeps those answers apart is harder to oversell and easier to dispute.

The fifth level is especially important for institutional use. Forecast explanations often arrive with the visual confidence of a finished chart, even when some historical patterns were rarely observed. A coverage and uncertainty layer changes the tone of the artifact. It lets the explanation say, in effect, this part is well supported, this part is extrapolated from sparse histories, and this part should trigger more data collection before it is treated as evidence.

Experiments

The paper evaluates KARMA against TimeSHAP, WinIT, Dynamask, Feature Occlusion, and Integrated Gradients, with synthetic vector autoregression settings and real-world forecasting benchmarks. In the synthetic case, the true data-generating structure is known, so the method can be checked against planted causal edges rather than judged only by surface plausibility.

For a Beijing PM2.5 case study, the paper applies the five-level hierarchy to a temporal convolutional network using eleven variables, three bins, and a selected K of four. In real-world experiments, the paper reports results across ETTh1, Beijing PM2.5, and Exchange Rate datasets and GRU, LSTM, and TCN architectures. It says KARMA is strongest on ETTh1, competitive on Beijing PM2.5, and flat on Exchange Rate where the series behaves close to a random walk under the paper's metric.

Governance Use

The governance value is not that a Markov surrogate automatically makes the original model transparent. The value is procedural. It demands that every explanation travel with an order K, a state construction, a transition estimator, a tolerance, a coverage map, a baseline, and a statement about which assumptions connect model-induced structure to real causal structure.

That is exactly where many deployed forecasting systems become weak. The organization can say the model produced an alert, but cannot say which temporal pattern triggered it, whether the claimed driver was a current value or a lagged one, whether the explanation was local or global, or whether a sparse region of the data was being treated as settled evidence. KARMA does not remove that burden. It gives the burden a schema.

A governance process built around that schema would ask for more than a pretty attribution graphic. It would ask for the retained history length, the feature bins, the source data window, the surrogate fidelity threshold, the estimator used, the histories excluded by low support, the uncertainty map, and the reviewer decision. It would also separate the model-induced graph from claims about the real world. A model can learn a dependency that is operationally useful, spurious, stale, or caused by measurement practice. The receipt must keep that distinction open.

Claim Boundary

The paper is careful about scale. Full joint kernel estimation can grow exponentially with the number of bins, variables, and retained lags, so KARMA uses marginal kernels and sampling strategies to stay tractable. The paper treats very high-dimensional settings as an open direction, and its causal reading depends on faithfulness and on the model-induced graph being a faithful representation of the relevant conditional structure.

Those limits should stay visible. A Markov receipt is not proof that a forecast is right, fair, or safe. It is a disciplined way to say what the model's temporal behavior appears to depend on, what evidence supports that statement, and which regions of the explanation should not be treated as settled.

That is a modest claim, and it is the useful one. Forecasting systems do not need ritual certainty from their explanation layer. They need a reproducible account of the temporal assumptions, the surrogate fit, the evidence coverage, and the unresolved boundary between model behavior and world behavior.

Sources


Return to Blog