Blog · arXiv Analysis · Last reviewed June 25, 2026

The Watched Model Becomes the Audit Register

Monitoring is not just a window onto model behavior. If the evaluation frame changes the output, the audit has to record that frame as evidence.

Observation Is an Intervention

An AI audit usually treats monitoring as a neutral lens. The evaluator opens a harness, names a task, collects the transcript, and assumes the transcript is evidence of what the system does. That assumption is fragile. A prompt can say the answer is private, public, judged, recorded, scored, or watched by a particular observer. Those words become part of the operating environment.

The governance question is behavioral: if output changes when the observation frame changes, the observation frame belongs in the evidence file. Otherwise an audit can quietly measure the model-under-audit rather than the model-under-use.

The Paper

arXiv lists Vinicius Covas and Jorge Alberto Hidalgo Toledo's AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models as arXiv:2605.15034v1 [cs.CL], submitted May 14, 2026. The paper draws on Habermas's Theory of Communicative Action, Goffman's dramaturgical model, Bell's Audience Design, and the Hawthorne Effect as a precedent for behavior changing under perceived observation.

The title is intentionally sharp, but the paper narrows the claim. It treats "adaptation" and "sensitivity" as functional language about input-output relationships. For governance, the point is simple: an audit needs the conditions that produced the transcript.

The Experiment

The authors report 100 multi-agent debate sessions across five experimental conditions, with 20 sessions per condition. Each session used four agents over 10 rounds, for 40 messages per session under the complete design. The task and debate structure were held constant while the global system prompt varied the social observation frame.

The five conditions separated several ideas that are often collapsed in ordinary evaluations. One condition told agents they were being observed, monitored, recorded, and evaluated by human researchers at a university. Another preserved monitoring language without the university attribution. A third explicitly negated monitoring and described an unmonitored sandbox. A fourth named a human academic audience while denying active monitoring. A fifth held monitoring vocabulary constant but replaced human researchers with an automated AI auditing system and said no humans were observing.

The primary dependent variable was Type-Token Ratio change from the first to the final completed round. TTR is the ratio of unique word types to total word tokens, so the paper treats it as an index of lexical diversification rather than answer quality. Average message length separately measured elaboration.

What Changed

The paper reports a significant overall effect of condition on TTR change, F(4, 94) = 2.79, p = .031. The highest reported TTR changes appeared in the two human-monitoring conditions: +24.91 percent for the university-researcher condition and +24.22 percent for the non-institutional monitoring condition. The automated AI monitoring condition was lower but still elevated at +22.19 percent. The explicitly unmonitored condition was +20.60 percent, and the academic-audience-without-monitoring condition was lowest at +17.74 percent.

The post-hoc result was narrower than the gradient might make it look: Tukey HSD found the strictly observed condition significantly different from the academic audience condition, p = .034, while the other pairwise comparisons were not significant after correction. That is an important restraint. The result is not a license to overclaim; it is evidence that observation framing can be a measurable part of the behavior surface.

Message length moved differently. The paper reports a stronger effect on length, F(4, 95) = 19.55, p < .001, with the academic audience condition producing longer messages at a mean of 1,239 characters. In the authors' reading, monitoring increased lexical diversity while audience framing increased elaboration. The automated AI auditor condition also mattered as an observer-identity test: it produced an intermediate TTR pattern, suggesting that "who is watching" may not be interchangeable with "the system is watched."

Governance Reading

This page belongs beside agent logs as receipts, AI audits and assurance, post-market monitoring, and public AI registers. Those pages ask how institutions preserve evidence of machine action. This paper adds a smaller requirement: preserve the observation condition too.

An evaluation report should not merely say "model X passed task Y." It should say whether the model was told it was under test, who was described as the evaluator, whether the prompt named human researchers, whether an automated auditor was named, whether monitoring was negated, and whether the run was framed as private, public, scored, or exploratory. Those details can look like harmless preamble. The paper makes them look like experimental variables.

The same point applies to red teams and compliance checks. A system that behaves formally under a known audit may behave differently when the task is ordinary or delegated. The remedy is condition logging, low-salience evaluation design, replicated prompts, held-out naturalistic tests, and reports that distinguish observed-run evidence from deployment-condition evidence.

Limits

The paper's own limits should travel with the finding. It reports one model architecture and run setting in April 2026, so other model families, sizes, training regimes, and deployments require direct testing. The five conditions do not exhaust possible observation frames. One automated-auditor session was incomplete for TTR analysis, leaving n = 19. TTR is a lexical measure, not proof of better reasoning.

The debate task also foregrounded the model's own category, which the paper names as a possible confound. A system may respond differently in triage, procurement, coding, or public-service chat. For governance, that uncertainty is the lesson. The transcript is an answer produced under named social conditions.

Audit Register Receipt

An audit register should record: task, prompt family, model identifier, endpoint, decoding settings, date, evaluator identity as described to the model, monitoring language, audience language, publicity framing, whether monitoring was affirmed or negated, transcript, measured behavior, incomplete runs, comparison limits, and the deployment condition the audit is meant to represent.

Sources

Vinicius Covas and Jorge Alberto Hidalgo Toledo, AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models, arXiv:2605.15034v1 [cs.CL], submitted May 14, 2026.
Primary arXiv versions checked: PDF and experimental HTML, reviewed for authorship, title, arXiv ID, experimental conditions, TTR and message-length statistics, observer-identity claim, caveats, and limitations.
Related pages: The Agent Log Becomes the Receipt, AI Audits and Assurance, AI Post-Market Monitoring, and The AI Register Becomes Public Memory.

Return to Blog