Blog · arXiv Analysis · Published: June 25, 2026

The Denoising Clock Becomes the Hidden State

Diffusion language models do not need an explicit timestep input to behave as if they are keeping time. That hidden clock should become part of the audit record.

The Paper

The paper is Subliminal Clocks: Latent Time Modelling in Diffusion Language Models, arXiv:2607.01774 [cs.AI, cs.CL]. The arXiv record lists version 1 as submitted on July 2, 2026. The authors are Maximo Rulli, Thomas Fontanari, Simone Petruzzi, Federico Alvetreti, Giorgio Strano, Donato Crisostomi, Giorgos Nikolaou, Tommaso Mencattini, Andrea Santilli, Emanuele Rodolà, Simone Scardapane, and Alessio Devoto. The arXiv metadata lists Sapienza University of Rome, EPFL, and independent-researcher affiliations, and the PDF is 23 pages.

The question is narrow but important: when a masked diffusion language model generates text by repeatedly filling masked positions, does it internally represent how far along the denoising process is, even when no explicit timestep is supplied?

The Hidden Clock

Autoregressive models expose a simple external order: token one, token two, token three. The models in this paper are different. The authors study large-scale masked diffusion language models, especially LLaDA-1.5 and Dream, which generate by starting with masked response positions and progressively unmasking them. The paper defines denoising progress as the fraction of response tokens already unmasked, using that value as an empirical proxy for time.

That framing matters for governance because the model's surface answer is not the whole event. A diffusion language model's behavior depends on a sequence of partial states, mask ratios, unmasking choices, and internal representations of progress. If an output is later audited, the record should say more than which prompt and model name produced it.

What the Probes Found

The authors trained MLP probes on residual-stream hidden states, separately for every layer in both LLaDA and Dream. They also varied the token subset used for training: masked tokens, unmasked tokens, or all tokens. For LLaDA, the probes maintain an R2 above 0.5 across all layers, and the paper reports similar trends for Dream. Masked-token activations are slightly more predictive, but non-masked activations also carry the signal. In plain terms, the model is not only counting visible masks at one location. Denoising progress is distributed through token representations.

The paper then approximates the same signal with mean activation vectors. It discretizes denoising into 100 bins, producing 3,200 mean vectors for LLaDA and 2,800 for Dream. The probe predictions on those mean vectors correlate strongly with the normalized denoising bin: Pearson 0.976 and Spearman 0.980 for LLaDA, and Pearson 0.962 and Spearman 0.974 for Dream. This turns the clock from a diagnostic guess into a structured activation-space object.

What Steering Changed

The strongest evidence is causal. The authors steer the hidden state toward a target denoising step and compare the resulting token distributions with clean runs and norm-matched random perturbations. In LLaDA, the main example steers layer 29; in Dream, a similar effect appears at layer 25. Steering toward a later denoising step increases confidence and decreases entropy. Steering backward decreases confidence and increases entropy. The KL divergence grows as the target moves farther from the current step.

The random perturbation baseline is weaker and less coherent. The paper reports that comparable random perturbations induce about half the KL divergence of the denoising-time steering. It also shows that early-layer perturbations can be corrected by later computation, although extreme targets are harder to erase. This is not a mystical clock. It is a usable control surface whose effects can be measured.

The geometry is compact. Across layers, most of the relevant mean-vector variance can be explained by fewer than three principal components, and steering only within the first two principal components reproduces the full intervention's behavior. LLaDA shows strong cross-layer alignment except for its final layer, while Dream organizes the representation into more heterogeneous layer blocks.

The Clock Receipt

A denoising-clock receipt should travel with any serious evaluation of this model family. It should name the model checkpoint, diffusion schedule, generation length, number of denoising steps, unmasking policy, prompt, random seed, layer probes, steering interventions, target bins, entropy drift, confidence drift, KL drift, token-level changes, random-baseline comparison, and downstream task deltas.

The receipt should also mark what is inferred rather than directly specified. The paper's point is that the timestep is latent. The system has learned an internal representation of progress, but the authors do not identify the full circuit that constructs it. For audits, that distinction matters. A probe can make a signal visible without making the mechanism fully understood.

Limits

The paper's limitation section keeps the claim bounded. The analysis is restricted to LLaDA and Dream, both trained with cross-entropy over masked tokens. It does not establish whether the same phenomenon appears in other diffusion language model designs, including block-diffusion systems. It also leaves open whether the signal can improve decoding or remasking strategies.

The authors also say token-level steering effects remain unexplored: their analysis focuses on sequence-level statistics such as entropy, confidence, and KL divergence, not a fine-grained account of which tokens change and why. The ethical-considerations section says the work analyzes publicly released models and involves no human subjects, new data collection, or new model release. That makes this a mechanics paper, but its audit implication is larger: hidden model time is still operational state.

Sources


Return to Blog