The Agent Action Becomes the State Signal
Andres Enriquez Fernandez and John J. Bird's June 2026 paper asks a concrete coordination question: when an autonomous agent cannot reliably talk, can its control policy make its actions easier to read as evidence of its state?
The Paper
The paper is Training Observable Control Policies to Expose Agent State Through Actions, arXiv:2606.27609 [cs.LG]. arXiv lists it as submitted on June 25, 2026, with cross-listing in Systems and Control, a related DOI of 10.2514/1.I011654, and a journal reference in the Journal of Aerospace Information Systems. The authors are Andres Enriquez Fernandez and John J. Bird.
The site already has pages on agent observability through entropy traces, context dashboards as agent proprioception, and process traces as accountability maps. This paper adds a different object: not a log after the fact, but the control policy itself as a source of observable state information.
Action Channel
The paper starts from a constraint familiar to anyone who has worked with distributed machines: direct communication can be delayed, dropped, unavailable, operationally risky, or absent because systems do not share hardware or language. In those cases, an observer may still see what an agent does. A vehicle brakes, an aircraft changes course, a robot alters its path. The action is not a message, but it can carry information.
Fernandez and Bird formalize that intuition. They describe an agent as a control policy mapping state to a distribution over actions. If an observer knows the policy and system dynamics, observed actions can be used as measurements for estimating hidden state. The paper uses an Unscented Kalman filter for that estimation problem, with the agent's sampled pseudocontrol actions as observations.
That framing changes the governance question. Monitoring is often treated as a separate interface: add telemetry, add logs, add a dashboard. Here, the policy is trained so that ordinary actions become more informative. The legibility is not bolted onto the side. It is part of the behavior that the agent learns.
Training Loop
The authors compare a task-only policy against a policy trained with an embedded estimator reward. The task-only policy is rewarded for the aircraft tracking task. The embedded-estimator policy is rewarded both for the task and for making the estimator's state reconstruction better. The paper is careful about the tradeoff: a reward bonus for estimator performance could reduce task performance, so task reward must be measured separately.
This is where the work becomes relevant beyond aerospace. A warehouse robot, field drone, software agent, or vehicle fleet can appear competent while being hard to monitor. A policy that maximizes task success alone may converge on behavior that hides state from observers, even without any deceptive intent. An observability term says that success includes leaving enough behavioral evidence for coordination and review.
Aircraft Test
The experiment uses a fixed-wing uncrewed aircraft system assigned to remain near a ground target by controlling bank angle. A second estimator tries to infer the aircraft's position and velocity relative to the target from the steering command. The paper evaluates the task-only and embedded-estimator policies over an ensemble of 14,500 episodes.
The reported state-estimation gains are substantial. At 200 seconds, the mean position-error norm is 21.77 meters for the task-only case and 12.15 meters for the embedded-estimator case, a 44.2 percent decrease. The mean velocity-error norm falls from 4.33 meters per second to 2.94 meters per second, a 32.1 percent decrease. The task penalty is small in the reported comparison: at the 80th percentile, task rewards are 80.1 for the task-only case and 79.1 for the embedded case, a 1.2 percent difference.
The observability analysis is also important. The authors report that neither policy is fully observable at any single state point. The difference appears over a trajectory. Using a stripped observability matrix, they find that after about 50 seconds, roughly two cycles, the embedded case shows a stronger sequence-level observability signal. The result is not "the state is always readable." It is narrower: the learned action sequence can make estimation easier over time.
State Receipt
An observable-policy system should leave a state-estimation receipt. At minimum, the record should include the policy version, state variables intended to be estimable, action variables exposed to the observer, estimator type, assumed dynamics, reward terms, observability metric, training environment, test distribution, task-performance penalty, divergence rate, and reviewer threshold for unacceptable opacity.
That receipt matters because action legibility is a double-edged design feature. A signal useful to teammates may also be useful to adversaries, competitors, supervisors, or insurers. A governance process has to ask who is authorized to read the state signal, whether the signal leaks sensitive operational intent, and whether the agent can continue safely when the observer's model of the policy is stale.
For human-machine teams, the receipt should be even stricter. A Kalman filter can process a policy in a way a human cannot. If the goal is human understanding, the system needs evidence that actual operators can interpret the behavior under workload, stress, latency, and partial observation. Otherwise "observable" only means observable to a mathematical estimator with privileged model knowledge.
Claim Boundary
The paper does not claim to solve human trust, general autonomy, or broad agent governance. It studies a specific estimation problem under a simulated aircraft tracking task. It also names a human-machine limitation: human understanding of system state is unlikely to follow the same dynamics as the Unscented Kalman filter used in the experiment.
That boundary is the useful part. The paper gives governance a precise design question: before deploying an agent into a communication-limited environment, can its actions be read well enough to support coordination, auditing, and safe handoff without destroying task performance? If the answer is no, the agent is not merely quiet. It is operationally opaque.
Sources
- Andres Enriquez Fernandez and John J. Bird, Training Observable Control Policies to Expose Agent State Through Actions, arXiv:2606.27609 [cs.LG], submitted June 25, 2026.
- arXiv HTML for Training Observable Control Policies to Expose Agent State Through Actions, reviewed for the abstract, agent definition, estimator formulation, aircraft tracking setup, results, observability analysis, funding note, and limitations.
- arXiv PDF for Training Observable Control Policies to Expose Agent State Through Actions, checked against the metadata record and reviewed for reported episode counts, error reductions, task reward comparison, and conclusion.