Blog · arXiv Analysis · Last reviewed June 25, 2026

The Fairness Audit Becomes the Query Budget

Ioannis Pitsiorlas, Martha V. Sourla, and Marios Kountouris ask how an auditor should test a deployed classifier when the audit consists of queries, outputs, and a finite budget.

The Audit Is the Stopping Rule

A fairness audit is often described as if the auditor receives the model, training data, deployment logs, and a quiet room. Many real audits look less generous. The system is already deployed, and the auditor may be allowed only to submit cases and observe decisions, perhaps with scores or logits, without seeing parameters or training records.

In that world, the fairness audit becomes the query budget. The practical question is not only "is this model fair?" It is also: how many cases must be queried, what can be observed, what tolerance defines compliance, and when is the evidence strong enough to stop?

The Spiralist angle is that the audit is a procedure with a receipt: access regime, protected attribute, metric, tolerance, query source, stopping rule, and the option to say "inconclusive."

The Paper Frame

The source is Sequential Fairness Auditing with Limited Output Access, arXiv:2606.30338v1 [cs.AI], by Ioannis Pitsiorlas, Martha V. Sourla, and Marios Kountouris. The arXiv record lists submission on June 29, 2026.

The paper starts from a governance fact: external evaluations are becoming more important, while independent auditors often have limited access to deployed systems. This connects to AI audit interfaces, AI audits and assurance, and algorithmic impact assessments, but its pressure point is narrower: what can be concluded when the auditor asks the system questions one at a time?

What the Auditor Can See

The empirical setting is a deployed binary classifier. Each input has a sensitive attribute with two protected groups. For Equal Opportunity, a ground-truth label is also needed because the metric conditions on positively labeled cases. The auditor does not receive model parameters or training data.

The paper studies three observability regimes. In decision-only access, the auditor sees binary predictions and performs hard-decision Statistical Parity or Equal Opportunity audits. In score access, the auditor sees prediction scores and performs score-based proxy audits. In logit access, the auditor sees logits before the sigmoid layer and performs an analogous proxy audit. A score or logit can reveal more about confidence, but it is not the same object as a hard decision.

The Sequential Test

The method is a sequential generalized likelihood-ratio audit. Instead of drawing a fixed sample, computing a disparity once, and declaring a result, the auditor accumulates evidence from a finite audit pool. The statistic is compared with operational stopping boundaries, and the procedure can accept compliance, reject compliance, or continue until the query budget or finite pool is exhausted. If no boundary is crossed, the outcome is inconclusive.

The fairness criteria are Statistical Parity and Equal Opportunity. Statistical Parity asks whether positive prediction rates are approximately equal across protected groups. Equal Opportunity asks whether true positive rates are approximately equal across groups, conditioning on the positive ground-truth class. The compliance question is tolerance-based, so a small disparity inside the configured tolerance can still count as compliant.

What the Results Show

The experiments use the American Community Survey Income task from Folktables and the UCI Adult dataset. Both predict whether income exceeds $50,000 and use sex as the binary protected attribute. The audited models are multilayer perceptrons, including robust and unstable base models and fairness-constrained variants trained with an Exponentiated Gradient reduction approach. Configurations are evaluated over 20 runs with shuffled finite audit pools.

The reported pattern is not "more access always wins." For Statistical Parity, richer output access reduces query cost in the experiments. For Equal Opportunity, the task is more sample intensive because it uses the smaller pool of positively labeled cases, and Adult EO cases near the tolerance boundary often remain inconclusive within budget.

Sequential auditing can reduce query cost sharply; the paper notes Adult Statistical Parity decisions with fewer than 250 samples on average compared with a full budget of 4000. But challenging EO settings show an efficiency-accuracy tradeoff, especially when inconclusive outcomes count against accuracy.

Why Governance Should Care

The page-level lesson is that an audit result should not travel alone. "Passed a fairness audit" is too thin unless the record includes the metric, protected attribute, finite audit pool, tolerance, query cap, output access level, stopping thresholds, observed outcome, and inconclusive rate.

For a regulator, procurement office, or civil-rights reviewer, this keeps black-box evaluation honest. Decision-only access is not the same audit surface as scores, logits, calibrated confidence, or richer records. A Statistical Parity result is not an Equal Opportunity result. A proxy audit is not an exact hard-decision compliance proof.

The method also gives inconclusive outcomes a proper place. Inconclusive is evidence about the audit itself: the query budget, access level, metric, and threshold did not support a stable conclusion. That is part of the public record.

Limits

The authors name several limits. The study focuses on binary-group fairness in binary classification, leaving multi-group and multiclass settings to future work. Score and logit audits are proxy audits and do not provide exact compliance guarantees for hard-decision Statistical Parity or Equal Opportunity. The sequential GLR thresholds are operational thresholds rather than exact finite-sample calibrations.

There is also a broader audit limit. Group-disparity testing cannot by itself prove that a system is explainable, lawful, contestable, privacy-preserving, or free of proxy discrimination. It can test a defined disparity under a defined observation regime.

Audit Receipt

The audit-grade sentence is: Pitsiorlas, Sourla, and Kountouris's arXiv:2606.30338 formulates limited-access fairness auditing as a tolerance-aware sequential generalized likelihood-ratio problem over finite audit pools, instantiates it for Statistical Parity and Equal Opportunity, and shows that audit efficiency depends on the metric, access regime, query budget, and distance from the tolerance boundary.

The practical receipt is: never cite a black-box fairness audit without the access regime, protected attribute, metric, tolerance, query source, model version, stopping rule, query cap, inconclusive outcome policy, and distinction between exact hard-decision audits and proxy score or logit audits.

Sources


Return to Blog