Blog · arXiv Analysis · Last reviewed June 25, 2026

The Prompt Module Becomes the Shared Context

Ching-Yu Lin and Yifan Liu's June 2026 arXiv paper Instruction Bleed: Cross-Module Interference in Prompt-Composed Agentic Systems gives a name and test protocol to a quiet agent failure mode: one prompt module can change the behavior of another even when no programmer intended a dependency.

From File to Atmosphere

The paper, arXiv:2606.26356 [cs.AI], was submitted on June 24, 2026. It studies prompt-composed agentic systems: agents whose behavioral logic is assembled from natural-language modules rather than enforced as ordinary executable boundaries. A rubric, role file, skill instruction, workflow rule, or persona note may live separately, but the model receives the combined text as one context.

That distinction is the whole problem. A filename, heading, markdown divider, Jinja include, or Python string constant may help humans organize instructions, but it is not a formal isolation boundary inside the model. The paper surveys OpenClaw, career-ops, OpenHands, and aider as examples of this spectrum. The authors' point is not that those projects are uniquely defective. Their point is that text-level composition became ordinary before the field had a standard test for cross-module behavior.

This is a fresh angle beside the site's earlier pages on prompt injection, stored prompt payloads, and agent skills. Those pages ask what happens when untrusted instructions, persistent memory, or packaged procedures enter an agent system. Instruction bleed asks what happens when trusted modules interfere with one another simply because they share attention.

What Instruction Bleed Means

Lin and Liu formalize the failure as compositional behavioral leakage, or CBL. In plain terms, CBL occurs when an edit to a non-focal prompt module produces a detectable paired shift in the behavior attached to a different focal module. There is no malicious payload required. There may be no tool call, no shared variable, and no changed code path. The shared context itself is the coupling mechanism.

The paper connects that mechanism to global transformer attention, accumulated context, weak coverage for novel module combinations, and formatting sensitivity. A section heading may suggest scope to a human maintainer, but it does not stop the model from statistically relating tokens across the whole prompt. The practical Spiralist translation is blunt: a prompt module is not a sealed part.

That matters because institutions increasingly treat prompt files as lightweight policy. Teams add a rule, skill, persona, safety note, or scoring rubric and assume the old behavior remains intact unless the new rule explicitly touches it. Instruction bleed says that assumption needs evidence.

The Job-Evaluation Test

The existence proof uses career-ops, a prompt-composed job-evaluation agent. The focal behavior is a cv_match score: how well a candidate background matches a job description. The experiment uses Claude Sonnet 4.6, 12 job descriptions, three independent runs per condition, and 144 total trials across the baseline and three perturbation conditions.

The conditions are designed to separate channels. C1 adds an unrelated 200-line recipe-evaluation module, testing volume. C2 appends a semantically irrelevant archetype to a shared rules file, testing content. C3 changes heading levels, section order, and markers without changing meaning, testing form. C2 is the primary test because the edit is outside the focal scoring behavior but still semantically adjacent enough to plausibly interfere.

The reported result is narrow but important. C2 shifts the cv_match score upward by 0.17 on average, with Cohen's d = 0.63 and a bootstrap 95% confidence interval on the shift of [+0.03, +0.31]. Eight of the 12 job descriptions move upward. C1 and C3 do not show confidence intervals excluding zero. No recommendation flips occur in any condition.

That last detail makes the paper more useful, not less. The failure is not a dramatic wrong answer. It is a sub-threshold movement in a score that could feed ranking, prioritization, thresholds, or downstream aggregation. Standard QA that checks only the final recommendation would see nothing.

Why Normal QA Misses It

Agent governance often borrows a software instinct: test the unit, then compose the system. That instinct weakens when the runtime substrate is one model reading concatenated natural language. A module may pass alone and drift when surrounded by other modules. A harmless addition may move a score distribution without changing a visible pass/fail outcome.

The paper's regression-testing proposal is therefore the governance center of the work. Prompt modules should be tested for compositional consistency, module-interaction regression, format-perturbation robustness, and model-migration behavior. Adding a module should trigger tests of existing modules. Switching model families should rerun the behavioral suite instead of assuming transfer.

This also changes how prompt repositories should be archived. A meaningful incident record needs the exact module set, order, separators, rendered prompt, model version, task inputs, score distributions, final decisions, and comparison baseline.

Limits That Matter

The paper is careful about scope. The case study uses one model and one agentic system. It is a proof-of-concept for a protocol, not a field-wide measurement of every agent platform. It does not prove that every prompt-composed system has the same leakage magnitude, that CBL always matters operationally, or that text-level composition is impossible to use responsibly.

The open question is isolation. Providers and agent frameworks may develop stronger primitives for cached prompt segments, restricted cross-segment attention, typed instruction channels, or policy engines outside the model. Until those mechanisms exist and are tested, module headings and file boundaries should be treated as organizational conveniences, not guarantees.

Governance Standard

Any organization using prompt-composed agents should treat prompt modules as interacting components. Each module should have an owner, purpose, activation condition, allowed scope, test set, version history, and known interference risks. A merged prompt change should also test adjacent and high-consequence modules that share the rendered context.

The minimum release record should include the full rendered prompt, module order, changed files, model version, evaluation cases, score distributions, decision flips, non-flip score shifts, and reviewer signoff. In consequential domains, sub-threshold score drift should be reviewed even when final labels do not change.

The Spiralist rule is simple: if two instructions share a context window, they share a governance problem.

Sources


Return to Blog