Blog · arXiv Analysis · Last reviewed June 25, 2026

The Intent Label Becomes the Safety Boundary

Jeremias Ferrao, Niclas Müller-Hof, Iustin Sîrbu, Traian Rebedea, and Yftah Ziser's June 2026 arXiv paper treats user intent as an explicit training signal for LLM safety classifiers.

From Prompt to Intent

The paper, arXiv:2606.27210 [cs.CL], is titled Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes. arXiv lists Jeremias Ferrao, Niclas Müller-Hof, Iustin Sîrbu, Traian Rebedea, and Yftah Ziser as the authors and records version 1 on June 25, 2026.

This is a fresh companion to the intent-governed tool authorization essay, the decomposed-task safety essay, and the AI jailbreaks reference. Those pages ask how intent is hidden, distributed, or used to narrow agent authority. This paper asks how a guard classifier should learn intent.

The core move is simple: do not make the classifier jump directly from prompt text to a safe-or-harmful verdict. Make the user goal an explicit intermediate representation. A safety decision then becomes inspectable in two places: what the classifier thought the user was trying to achieve, and how that intent was mapped to harm.

What AIMS Adds

The authors introduce AIMS, Annotated Intents for Model Safety, a human-annotated dataset derived from WildGuardMix. They selected difficult prompts using uncertainty estimates from an ensemble classifier, enriching the sample for ambiguous, adversarial, and borderline cases. The paper reports 1,724 candidate prompts, with 70.1% marked adversarial in the original dataset.

Annotators inferred the user's intent, wrote a concise description, and assigned a four-point harm label later collapsed into a binary safe-or-harmful label. After quality filtering, AIMS contains 1,275 unique prompts with human-written intents and harm labels. Human labels match the original WildGuardMix labels for 72% of prompts, with disagreements concentrated among adversarial prompts.

That disagreement is the point. A guardrail trained only on surface labels may learn that certain words are dangerous, or that role-play language is harmless theater. AIMS gives it a separate target: what goal the user is pursuing under the surface form.

Training the Guard

The paper tests intent as a training signal across supervised fine-tuning, Direct Preference Optimization, reasoning distillation, and Group Relative Policy Optimization. In the SFT setting, the model is trained to generate an intent plus a harm label rather than only the label. In DPO, the rejected completions are model-generated alternative intents that are either label-changing or unfaithful to the human annotation. In distillation, students learn from teacher traces with or without intent grounding. In GRPO, the reward can include intent faithfulness.

The result is not one recipe winning every subcase. The paper says DPO and GRPO recover different errors: DPO is more useful against adversarial cover stories, while GRPO helps with over-refusal cases. Roughly 30% of SFT errors remain unrecovered by all three intent-aware methods, especially adversarial, dual-use, or harm-adjacent benign prompts.

Across five external safety benchmarks, the intent-aware systems achieve the strongest average performance among the evaluated systems. The latency comparison is also concrete: SFT on annotated intents is reported at 4.66 ms with F1 0.791, LE-DPO at 5.52 ms with F1 0.812, and GRPO at 25.28 ms with F1 0.836.

Why It Matters

The Spiralist lesson is that a safety classifier is an institution of interpretation. It decides whether a user is asking for information, rehearsal, evasion, assistance, satire, diagnosis, or harm. When that interpretation is hidden, users see only the refusal or allowance. Operators see metrics, but not always the judgment behind them.

Making intent explicit can improve safety and appeal. A false refusal can be audited as a mistaken harm mapping, a mistaken intent reading, or a policy choice. A false allowance can be audited as a failure to detect hidden harmful purpose rather than as vague model unreliability. This matters for agents, where the first safety label can decide whether tools, memory, files, or external actions become available.

The danger is intent overreach. An intent label is evidence produced by a model trained on a dataset. It is not direct access to thought, legal intent, moral character, or proof of future action. If institutions use intent classifiers for moderation, fraud, cyber defense, school discipline, workplace monitoring, or law enforcement, the label must remain contestable.

Limits That Matter

The paper is about prompt-level safety classification. It does not establish that the same gains transfer unchanged to response-level moderation, multi-turn context tracking, or downstream choices about refusal, redirection, and escalation.

AIMS is also intentionally targeted rather than distributionally representative. It comes from one source dataset, then filters for examples where intent is expected to matter. That makes it useful for studying the mechanism, but it can inherit taxonomy choices and coverage gaps from WildGuardMix. The four-point harm annotation is also collapsed into a binary label downstream, so some uncertainty is compressed away.

Finally, several regimes rely on model-generated or model-evaluated signals. The paper notes that DPO and GRPO use an LLM judge to assess intent faithfulness, and that distillation rationales should not be interpreted as faithful reconstructions of a teacher model's internal reasoning. Intent supervision helps make the guard inspectable. It does not make the guard infallible.

Governance Standard

Any intent-aware safety classifier used in consequential settings should publish an audit record: model and checkpoint, training datasets, intent annotation guidelines, label taxonomy, uncertainty handling, judge model if any, benchmark set, per-domain error rates, latency, refusal and allowance thresholds, appeal path, logging policy, and release conditions for red-team examples.

The key discipline is to keep the intent label separate from the final safety action. A model may infer intent. A policy decides what that intent permits or blocks. A product decides how the user is told. Reviewers should be able to inspect each layer separately, especially when a safety boundary denies access, grants tool authority, or escalates a user for review.

When the intent label becomes the safety boundary, the label must travel with provenance, uncertainty, and contestability. Otherwise a guardrail becomes a quiet accusation machine.

Sources

Jeremias Ferrao, Niclas Müller-Hof, Iustin Sîrbu, Traian Rebedea, and Yftah Ziser, Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes, arXiv:2606.27210 [cs.CL], version 1 submitted June 25, 2026.
arXiv PDF: Paved with True Intents, reviewed for AIMS construction, annotation counts, training regimes, benchmark and latency results, qualitative error analysis, limitations, and ethical considerations.
Project page: Paved with True Intents; Hugging Face collection: AIMS: Intent-Aware Safety Classification.

Return to Blog