Blog · arXiv Analysis · Last reviewed July 2, 2026

The Rare-Valid Future Becomes the Intelligence Measure

Ishanu Chattopadhyay's June 2026 arXiv paper asks whether intelligence can be measured across passive matter, feedback controllers, language generators, humans, and idealized information engines with one physically grounded vocabulary.

For this essay, a thermodynamic-intelligence receipt is the record that binds an intelligence claim to its baseline law, rare-valid set, validity criterion, trajectory resolution, induced probability shift, resource accounting, and reproducibility artifact.

The Claim

The paper, arXiv:2606.20231 [cs.AI], was submitted on June 18, 2026. arXiv lists the title as Thermodynamic Measure of Intelligence. The author is Ishanu Chattopadhyay of the University of Kentucky.

The core definition is unusual but precise in spirit: intelligence is the lawful amplification of rare but valid futures. A system is intelligent, in this sense, when it makes outcomes more likely that would have been unlikely under passive dynamics but still admissible under the constraints of the domain.

The useful claim is not that every model can now be ranked by a single universal scoreboard. It is that any serious intelligence measurement should say what baseline was used, what rare-valid target was selected, what made the target valid, and how much probability mass the system actually moved toward it.

The Paper Frame

The paper contrasts its path-facing approach with task-facing accounts of intelligence. Turing-style imitation, reward maximization, benchmark success, skill acquisition, compression, and reasoning all remain useful, but they do not directly identify what a system does to the probability distribution over possible futures.

Chattopadhyay's answer is to work with trajectory laws. Passive dynamics define a baseline distribution over trajectories. A controlled or agent-like system induces a different distribution. The proposed measure asks whether the induced law shifts probability mass toward rare-valid trajectories.

This makes the measurement explicitly level-relative. A claim must specify the level of description, baseline path law, validity criterion, and observational resolution. Without those choices, "intelligence" is too underspecified to audit.

Rare-Valid Lift

The phrase "rare-valid" does important work. Rarity alone is not intelligence, because noise can generate improbable events. Validity alone is also not enough, because common valid outcomes may only show stabilization. The target is the intersection: outcomes that are hard under the passive baseline but still legal, coherent, viable, executable, or successful under the domain constraints.

The paper defines thermodynamic intelligence as rare-valid probability lift: the fractional increase in probability assigned to a rare-valid set by the induced law compared with the passive law. In symbolic domains, the same idea becomes probability mass on valid strings that are rare under a specified baseline.

This is the governance hook. The number is not meaningful by itself. It travels with the baseline and the target set. A model can look more intelligent because the baseline was weak, the validity predicate was loose, or the resolution made the rare set easier than it looked.

Recursive Self-Simulation

The architecture claim is that high rare-valid lift requires recursive self-simulation. The system must model a world that contains itself as a causal object, then evaluate futures that include its own actions, observations, memory updates, and later information states.

The formal results connect that architecture to the measure. Under bounded amplification, high rare-valid lift is impossible unless the internal simulation identifies rare-valid futures with high fidelity. Conversely, if rare-valid fidelity is high and the simulation contains an effective policy, the achievable lift can approach the actuation-limited optimum.

The paper is careful about failure modes. A system can identify rare-valid futures without making them likely. A poorly resolved controller can amplify the wrong futures. Imperfect rare-set identification creates false positives and bookkeeping costs, which the paper treats with protocol-dependent thermodynamic accounting.

Scale Examples

The numerical examples are calibrations, not a finished taxonomy. The paper reports a stabilized double-log scale, Lambda = log10(log10(I + 1) + 1), so passive matter, controllers, symbolic generators, and Maxwell-demon-like systems can appear on a shared compressed scale.

In Table II, passive matter has zero lift by construction. A narrow fixed-feedback controller has rare-valid lift between 1 and 99. Repeated dynamic control compounds seven to ten binary improvements, giving lift from about 1.27 x 10^2 to 1.02 x 10^3. Maxwell-demon-like examples become much larger because microscopic information can select rare thermodynamic trajectories before measurement, memory, control, and erasure costs are fully paid.

Those examples are useful precisely because they force the receipt. A demon example must say which particles, which entropy-reducing trajectory, which passive fluctuation scale, and which costs have not yet been paid. A controller example must say what rare-valid set the controller is amplifying.

Symbolic Generation

The symbolic section applies the framework to sentence-scale text. The paper uses a finite-resolution estimate of about 5 x 10^21 interpretable English sentence-scale strings, then defines a high-quality target set on the order of 10^7 sentence-scale units. Under those assumptions, the baseline mass of the target set is about 2 x 10^-15.

For an expert human process conditioned on producing exemplary sentence-scale text, the paper gives IH + 1 around 5 x 10^14. For GPT-5 long-form prose, it imports entropy-rate estimates from a separate workflow: 0.77 bits per character for Gutenberg prose and 0.74 for GPT-5 prose under a 27-symbol alphabet. With a 100-character sentence-scale correction and central slack set to zero, it estimates IGPT5 + 1 around 6.25 x 10^13.

That section should be read cautiously. It is not a detector, not a claim about all humans, and not a universal LLM ranking. It is a finite-resolution calculation that depends on the chosen corpus, alphabet, sentence length, target set, entropy estimator, and slack term.

Governance Reading

The Spiralist reading is that intelligence metrics become governance artifacts when they choose a future and call it valid. A metric that rewards rare-valid lift is also specifying what counts as valid, what counts as passive, and what kinds of probability shifts deserve institutional credit.

That is better than a vague leaderboard, but only if the receipt is visible. An AI system that raises the probability of rare valid proofs, repairs, diagnoses, designs, or plans should expose the baseline distribution, target definition, validation method, intervention channel, and cost accounting.

The danger is metric laundering. A compressed number can make a philosophical definition look operational before the baseline, rare set, validity predicate, and reproducibility code have been inspected.

Measurement Receipts

A useful thermodynamic-intelligence receipt should include the system identity, level of description, trajectory space, passive baseline law, induced law, rare-valid set, target mass, validity predicate, observational resolution, lift estimate, compressed scale, actuation limit, simulation-fidelity evidence, false-positive handling, resource accounting, and uncertainty range.

For symbolic generators, the receipt should also include prompt distribution, corpus choice, alphabet, preprocessing, sequence length, entropy estimator, target-set construction, support correction, slack term, sampling settings, and whether the calculation is sentence-scale, document-scale, or task-scale.

The receipt should separate calibration from deployment. A numerical scale used to compare passive gas, controllers, GPT-5 prose, human prose, and Maxwell-demon-like systems should not be treated as a procurement score or safety claim unless the local baseline and validity predicate match the deployment problem.

Limits

The paper is ambitious and assumption-heavy by design. The framework requires explicit choices that can dominate the result: baseline law, validity criterion, level of description, observational resolution, rare-set construction, and induced probability law. Different reasonable choices can change the measurement.

The symbolic estimates are order-of-magnitude calibrations. The paper and repository both note that the TME code reproduces numerical scale calculations but imports GPT-human entropy-rate estimates from a separate NERO workflow. The TME repository does not regenerate those entropy estimates from raw text.

The strongest safe reading is therefore: this is a formal proposal for making intelligence claims physically and probabilistically inspectable. It is not yet a practical universal benchmark, and it should not be used without the full measurement receipt.

Source Discipline

This page treats the arXiv abstract, arXiv HTML, PDF, public TME repository, and data/code availability statement as the source set. The PDF text was used for exact table values and symbolic-calculation details because those are easier to verify there than in the HTML rendering.

I did not treat third-party summaries as sources for the analysis. The linked repository was used only to confirm reproducibility scope, imported entropy-rate provenance, and the named calculations behind the manuscript scale values.