The Judicial Discretion Model Becomes the Gate
Stanisław Sójka, Felix Steffek, and Matthias Grabmair's June 2026 arXiv paper treats judge identity as a conditional signal: not a verdict, not a profile, but a gate showing when adjudicative context changes a legal outcome model.
Not Legal Advice
The paper, arXiv:2606.27069 [cs.CL], was submitted on June 25, 2026. arXiv lists the exact title as Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning, by Stanisław Sójka, Felix Steffek, and Matthias Grabmair. The arXiv record also notes that the paper was accepted to the AI for Law Workshop at ICML 2026.
This page is not legal advice and not a claim that litigants, lawyers, employers, workers, or courts should use outcome prediction as a decision-maker. It is a governance reading of one machine-learning paper about legal outcome prediction. The important move is architectural: the authors ask when judge identity should influence a prediction, and when it should be suppressed.
The Paper Frame
The paper distinguishes merit-based determinations from non-merit-based disposals. A merit-based result depends on facts and legal merits. A non-merit disposal may turn on jurisdiction, deadlines, non-compliance, strike-out, or other procedural grounds. The authors' claim is that a standard legal classifier can blur those pathways, treating a technical termination and a substantive decision as if they came from the same kind of evidence.
That distinction matters for AI governance because "legal prediction" is usually too blunt a phrase. A model predicting whether a claimant wins may be detecting factual strength, procedural vulnerability, writing patterns in final judgments, tribunal practice, or a judge-specific case-management signal. The governance inference from the paper is that the system should expose which channel it is using.
The Tribunal Dataset
The study uses the CLC-UKETpred corpus of UK Employment Tribunal decisions from 2011 to 2023. After resolving judicial identities and excluding 673 cases with missing or ambiguous judge identifiers, the final experimental dataset contains 13,937 cases. The paper reports a 70 percent training, 15 percent validation, and 15 percent test split, with about 2,091 cases each in validation and test.
The coarse General Case Outcome task has four labels: Claimant Wins, Claimant Loses, Claimant Partly Wins, and Other. The authors also build an 11-class Detailed Case Outcome taxonomy with two UK labor-law scholars and an LLM-assisted extraction pipeline. They treat those detailed labels as noisy auxiliary supervision, not absolute ground truth. That caveat is doing real work: the taxonomy helps train the model, but it is not a universal legal ontology.
What the Gate Does
The proposed system combines a legal-text encoder, label-wise attention, multi-task learning, and a judge-aware gated fusion module. In the ModernBERT track, judge identity is not pasted into the text prompt. It enters through a learned embedding. The gate acts as a conditional switch: when the textual evidence appears discretionary or procedurally sensitive, the model can consult the judge embedding; when the outcome looks clearer, it can suppress that signal.
The paper then compares this structured route with generative supervised fine-tuning of a Gemma-4 26B-A4B backbone, where judge identity and detailed-outcome supervision are supplied through prompt or output tokens. The key experiment isolates the conditioning interface. The same backbone can behave differently depending on whether institutional context is treated as prose in a prompt or as a structured signal exposed to gradients.
What the Results Show
The headline B2 model, a LoRA-adapted Gemma-4 encoder with the structured head, reaches 65.21 macro-F1 and 75.55 weighted-F1 on the UKET benchmark. The strongest prompt-style generative variant listed in the table reaches 60.12 macro-F1. The best ModernBERT discriminative model reaches 52.66 macro-F1. The authors report that the gains concentrate on rare and fuzzy classes, especially Partly Wins and Other.
The paper also performs counterfactual judge-swap analysis. In that test, the model's predicted outcome distribution is recalculated after substituting alternative judge identities. The point is not to declare what a real judge "is." It is to locate where adjudicative context changes the model. The authors explicitly state that learned judge representations are behavioral proxies from outcome data, not psychological profiles.
Governance Reading
The Spiralist reading is that the gate is a better metaphor than the score. A score says the model is more accurate. A gate says where institutional context entered. For AI evaluations, that is the more useful artifact. It gives auditors a place to ask whether the system is measuring law, procedure, writing style, missing filings, judge identity, or a mixture of all five.
The same lesson applies beside legal agents, synthetic evidence, and court-facing citation tools. Legal AI does not become legitimate because it uses a larger model. It becomes more inspectable when its conditioning interface, labels, training data, sensitive variables, and failure modes can be named before deployment.
Limits
The paper's own limits keep the claim bounded. The inputs come from final written judgments rather than original case filings at scale, so some performance may reflect retrospective factual summaries or writing cues rather than raw legal merit. The judge embeddings are correlations in available outcome data. They do not reveal judicial psychology or causal reasoning.
The 11-class detailed taxonomy is produced with LLM assistance and is treated as silver-standard supervision. The Gemma-4 comparisons use one model family and one seed. The empirical evaluation relies on one dataset in one jurisdiction, and the authors note that many jurisdictions redact judge identities, making similar analysis harder or inappropriate elsewhere. The page therefore treats the paper as a controlled design study, not a portable courthouse product.
Discretion Receipt
A legal-outcome model receipt should record: jurisdiction, date range, source corpus, filing availability, judgment-summary source, judge-identity policy, excluded-case count, label taxonomy, human expert role, automated-labeling pipeline, train/validation/test split, model backbone, conditioning interface, sensitive-variable treatment, per-class performance, counterfactual identity-swap behavior, calibration, stated limits, and prohibited uses. The audit-grade sentence is not "the model predicts justice." It is: under this dataset and this label scheme, this architecture shows where adjudicative context affects a prediction and where the evidence is still insufficient.
Sources
- Stanisław Sójka, Felix Steffek, and Matthias Grabmair, Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning, arXiv:2606.27069 [cs.CL], submitted June 25, 2026.
- arXiv PDF and HTML versions: PDF and experimental HTML, reviewed for title, authorship, submission date, workshop note, dataset size, label taxonomy, architecture, benchmark comparisons, counterfactual judge-swap method, impact statement, and stated limits.
- Related pages: AI Evaluations, The Legal Agent Becomes the Associate, The Synthetic Evidence Becomes the Court Record, and The Citation Machine Enters the Court.