Blog · arXiv Analysis · Last reviewed June 24, 2026

The Support Frequency Becomes the Rule Survival Filter

The June 2026 arXiv paper Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining, by Juliana Li and Diya Sreedhar, studies a small language model that learns a grammatical rule during pretraining and then stops applying it before training ends.

Capability Is Not a Checkpoint

The paper, arXiv:2606.26050v1 [cs.LG], was submitted on June 24, 2026. Its most useful governance lesson is modest and severe: seeing a model perform a capability at one checkpoint does not prove the final model will keep it. Li and Sreedhar call the within-run reversal natural ungrokking. A rule appears, generalizes to held-out probes, and later vanishes from behavior while the relevant evidence still remains in the stationary training stream.

That matters because AI evaluation often treats a checkpoint as a durable fact about the system. The paper asks for a different object: a trajectory. Did the rule emerge, stabilize, collapse, or get displaced by a more frequent competing pattern? For institutions that use benchmarks, model cards, or release gates, the difference is not semantic. A passing mid-training snapshot can become a false memory of the system.

What Natural Ungrokking Measures

The focal experiment uses an 11.5M-parameter, four-layer decoder-only transformer with an 8192-symbol BPE vocabulary, trained for 4,400 steps. The rule under study is pronoun-gender resolution. Given a prefix such as "Sue cried because", the rule-compatible continuation is she, while a corpus-wide surface prior can favor he. The model reaches 0.94 held-out conflict accuracy by step 925, then falls near 0.00 by steps 3,500 to 4,400, while agree controls remain solved.

The authors compare TinyStories, where the focal rule is densely supported, with a filtered web corpus derived from ClimbMix, where support falls below the paper's registered counting floor. Across un-intervened runs spanning two corpora, three data budgets, and three seeds, the focal rule survives in all nine TinyStories cells and in no web cell. Data-to-parameter ratio changes how deep the failure becomes, but it does not flip the survival verdict in that grid.

Support Frequency Decides Fate

The paper's central statistic is support frequency: how often the training stream shows the rule winning under a frozen counting procedure. The result is not that the model forgets a construction wholesale. The authors describe a displacement: the construction remains available, but a competing surface pattern wins when the rule and prior conflict.

They track that displacement with a contrast margin, the log-probability margin between the rule-conforming continuation and the prior-conforming continuation on a frozen prompt set. In base runs, the final margin's sign separates recovered and displaced cells. In valid web seeds, the margin crosses zero within one 100-step checkpoint of the behavioral collapse. Public Pythia checkpoints from 70M to 1.4B parameters and OLMo-1B show a related emerge-then-collapse signature, with end-of-training survival ordered by Pythia scale; the authors report Spearman ρ = 0.894 for that scale ordering.

Killing Is Easier Than Restoring

The causal section is the sharp part. In a surviving TinyStories cell, the authors flip rule-supporting pronoun tokens after girl-name cues at pre-registered rates. Full flipping kills the rule in three of three seeds and moves the contrast margin from +3.68 in the base condition to -2.99. A second rule, a/an allomorphy, is tested inside one TinyStories corpus through a five-dose ladder; final held-out conflict accuracy falls monotonically from 0.96 to 0.00 while unrelated families stay close to baseline.

The reverse intervention does not mirror the kill. The authors inject matching evidence into the collapsed web corpus up to three times the TinyStories support density, with post-injection rule-to-prior ratios above the surviving TinyStories cell. The margin moves, and one head-level signal partially re-forms, but no seed produces a control-valid behavioral recovery. The paper's practical claim is therefore not just that data matters. It is that evidence can be easier to remove than to reconstitute after training dynamics have consolidated another pattern.

The Governance Problem

Natural ungrokking turns training data governance into capability governance. A data filter that thins a rare rule's support may leave examples in place and still alter the final model's behavior. Continual pretraining on a shifted mix may silently undo a capability a base model once had. Ordinary loss curves may not flag the change because the model can keep fluent prediction while abandoning a contested rule.

The page belongs near earlier Spiralist notes on training-set feedback, data-sheet supply chains, training opt-out interfaces, correction layers, and context compaction as policy deletion. The common theme is that information does not merely enter a model. It competes for survival inside a finite system.

Limits That Matter

The paper is careful about scope. The causal interventions are established at 11.5M parameters. The public checkpoint evidence shows a related phenomenon and a scale ordering up to 1.4B parameters, but the intervention has not been run at larger scale. The primary verdicts use templated probes. An out-of-distribution battery supports the main pattern, but naturalistic evaluation is left for future work.

The asymmetry claim also has narrow mechanics. The behavioral timing test covers three seeds and one rule in the early-window variant, and the valid mechanism comparison is thinner still. The authors also say the remaining quarantined seed set is unread by design. Those limits do not erase the governance lesson; they tell auditors how much weight to put on it.

Governance Standard

A serious pretraining report should separate loss curves from rule trajectories. It should identify the rule families tested, support-frequency counters, data filters, data-to-parameter ratios, checkpoint spacing, frozen probes, agree controls, contrast margins, and failed registered predictions. A model card should not simply say a capability was observed. It should say when it appeared, whether it survived, and what training evidence supported it.

The governance demand is not mystical. It is bookkeeping with consequences. If a model's behavior depends on which rules survive a corpus contest, then dataset decisions, filtering policies, and continual-pretraining mixes are part of the control surface. Natural ungrokking gives that control surface a name and a warning label: a capability can be real at step 925 and gone by the end.

Sources


Return to Blog