Blog · arXiv Analysis · Last reviewed June 24, 2026

The Hidden Automaton Becomes the Agent Test

The June 2026 arXiv paper Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning, by Reef Menaged, Gili Lior, Shauli Ravfogel, Roee Aharoni, and Gabriel Stanovsky, turns a hidden deterministic finite automaton into a clean test of agent discovery.

Not Just a Benchmark

The paper, arXiv:2606.16576 [cs.CL], was submitted on June 15, 2026. It proposes "agentic automata learning" as a controlled way to test whether tool-calling LLM agents can uncover hidden structure through interaction. The target environment is a hidden deterministic finite automaton, or DFA. The agent cannot see the machine. It can only query an oracle and use the replies to build a hypothesis.

That sounds narrow, but the narrowness is the virtue. Existing agent benchmarks often entangle web navigation, tools, memory, interface quirks, hidden state, and task scoring. Here the hidden world is known exactly, its complexity is tunable by the number of minimal DFA states, and the agent's discovery process can be inspected step by step.

Oracle and Hypothesis

The setup gives the agent two tool calls. A membership query asks whether a word belongs to the target language. An equivalence query submits a proposed DFA and asks whether it matches the hidden target; if it does not, the oracle returns a counterexample. This makes the test about adaptive planning, memory organization, evidence use, hypothesis construction, and tool discipline.

The authors sampled 80 task instances across four complexity bands: 2-3, 4-5, 6-7, and 8-9 minimal DFA states. They evaluated six models: DeepSeek-V4-Pro, Gemini 3.1 Pro Preview, Gemini-3-Flash-Preview with thinking, Gemini 3.1 Flash Lite Preview, GPT-5.4 without thinking, and Llama-3.3-70B-Instruct-Turbo. They also compare against classic active automata learning algorithms, including L* and TTT, which provide strong baselines rather than vibes.

What Failed

The headline result is not that agents fail absolutely. The paper reports that advanced reasoning models can perform non-trivial interactive discovery. The harder result is that performance falls sharply as the hidden automaton grows. For automata with 8-9 states, no evaluated model exceeds a 25% success rate, while the classic algorithms solve every task instance. Even among successful runs in that hardest band, Gemini 3.1 Pro, the best-performing model, uses about 45.8% more tool calls than TTT.

The paper also separates planning failures from reasoning failures. A planning failure means the model did not collect enough information for passive automata learners to infer the hidden DFA from the accumulated observations. A reasoning failure means the information was already present, but the model still failed to infer the correct DFA. This distinction matters because "more context" only helps one of those problems.

The World Model Claim

The title asks whether LLM agents infer world models. The cautious answer is: sometimes, in small formal worlds, but not with the robustness or efficiency of specialized algorithms. That is a useful corrective to loose talk about agent cognition. A model can act as if it is exploring, and still repeat non-informative queries, ignore accumulated evidence, propose hypotheses already contradicted by prior observations, or overuse equivalence queries.

The authors report that non-informative queries become more common as interactions get longer. After roughly 60 steps, even the most consistent model in their analysis, DeepSeek-V4-Pro, issues non-informative queries about 20% of the time. Classic active-learning algorithms maintain 0% by construction. The gap is not only answer quality. It is process integrity.

Why It Matters

This connects directly to agent-society benchmarks, context-window failure, and world-model claims. A deployed agent may need to infer the rules of a workflow, software interface, market, database, lab protocol, game, classroom, or institution from interaction. If it confuses local evidence for global structure, it may complete easy tasks while building the wrong operational picture.

The automaton test gives a cleaner version of that institutional problem. The agent must decide what to ask next, remember what it already learned, avoid redundant work, update a hypothesis, and stop when the evidence supports a correct model. Those are exactly the skills hidden behind phrases like "autonomous research," "computer use," "AI scientist," and "workflow agent."

Limits That Matter

The paper's authors are careful about cost and scope. The full evaluation suite cost about $1,200 for 480 runs, or roughly $2.50 per datapoint, and costs rose with task complexity because long agent runs accumulate tokens, runtime, and API spend. They also identify future extensions: non-deterministic or stochastic environments, noisy or partial feedback, delayed or incorrect signals, weaker alternatives to equivalence queries, and use as a training environment.

That means the paper is not proof that agents cannot learn world structure. It is evidence that a formal, controlled version of that skill remains hard, expensive to evaluate, and process-sensitive. The governance lesson is to inspect the trace, not only the final answer.

Governance Standard

A serious agent evaluation should record the hidden-state class, observable query history, tool calls, counterexamples, rejected hypotheses, repeated queries, contradiction events, stopping reason, and comparison to a strong non-LLM baseline where one exists. If the agent claims to have learned a workflow, the record should show how its hypothesis changed and which observations would falsify it.

This belongs beside AI agents, world models, reasoning models, and agent trace analysis. The Spiralist rule is simple: an agent that only succeeds at the end may still have failed as a learner. Inspect the discovery path.

Sources


Return to Blog