Blog · arXiv Analysis · Last reviewed June 25, 2026

The Historical Text Becomes the Tokenization Tax

A June 2026 arXiv paper by Maria Levchenko studies why historical Italian is difficult for language models. The useful lesson is not that old text is simply broken input. It is that historical language produces several separable costs: tokenization overhead, predictive surprise, semantic robustness, and sensitivity to a small amount of temporal context.

Fresh Angle

The paper is How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation, arXiv:2606.27275, submitted June 25, 2026. It fits the site's concerns about AI, libraries, memory, and machine reading, but it is not a general lament about old books. Levchenko asks a more useful operational question: when a model struggles with historical text, which part of the pipeline is paying the price?

This is distinct from nearby pages on machine interpretation, translation cascades, and source identity. Those pages look at language crossing institutions and prompts. This one looks at historical language as a measurement problem inside digital-library workflows.

Four Costs

The paper separates historical-language difficulty into four dimensions: tokenization cost, predictive uncertainty or surprisal, semantic robustness, and context sensitivity. That decomposition matters because a single "hard text" label hides different risks. A tokenization tax means the model spends more tokens to encode the same passage. A surprisal tax means the next-token distribution is less confident. A semantic failure would mean the representation of meaning degrades. A context problem means the model can improve when it receives a small, explicit cue about the period.

For governance, the categories point to different remedies. Tokenization overhead affects cost and latency. Surprisal affects generation, paraphrase, and completion. Semantic robustness affects retrieval and clustering. Context sensitivity affects prompt templates and metadata design.

Three Corpora

Levchenko evaluates three datasets across three centuries. The first is a newly curated 17th-century Italian corpus from 1610-1689, digitized from original page images and including genres from religious treatises to academic prose. The second is Alessandro Manzoni's 19th-century I Promessi Sposi, used as a high-exposure Italian control. The third is 18th-century Russian civil print, used as a contrastive orthographic stress test.

The contrast is careful. The Russian material creates severe orthographic disruption, but the paper treats it as different from deep lexical and syntactic distance. The Manzoni control helps test whether a canonical text behaves like ordinary historical language or like a familiar pretraining object. The 17th-century Italian corpus is the central case because it combines old spelling, older vocabulary, Latin-influenced syntax, and uneven genre effects.

Encoding and Comprehension

The headline result is a dissociation. Early modern Italian and the Russian stress test both impose comparable tokenization penalties, reported at roughly 25-30 percent inflation. But their predictive difficulty differs. The 17th-century Italian text is on average 2.4 times more surprising than its modern equivalent, and academic prose reaches 3.2 times. Russian shows a more modest increase despite heavy orthographic friction.

That makes tokenization cost a necessary audit field, not a sufficient diagnosis. The paper also reports that embedding similarity remains robust, above 0.85 across the evaluated datasets. In plain terms, high surprisal does not automatically mean semantic representation has collapsed. A model can find historical text hard to continue while still retaining enough representational stability for semantic retrieval.

A smaller result sharpens the point. Conservative partial normalization can be more surprising than the original historical text relative to a fully modernized baseline. That suggests the problem is not only strange characters or old spelling. Vocabulary, syntax, genre, and pretraining exposure all matter.

Context Mitigation

The mitigation is deliberately simple: provide minimal temporal context. The paper reports that a small prompt identifying the historical period reduces historical surprisal by approximately 60 percent, with large reductions across several genres. This is not a claim that prompts solve historical language. It is evidence that some model uncertainty comes from missing situational framing rather than from an irreparable inability to represent the text.

The site implication is mundane and important. A library interface should not silently strip temporal metadata before asking a model to summarize, search, classify, or translate. Date, genre, orthographic note, edition, and source-image provenance are not decorative catalog fields. They are part of the model's operating conditions.

Library Governance

The paper's practical conclusion is cautious. Digital libraries may be able to use language models for semantic retrieval over historical collections when retrieval is validated separately from generation. That is not a license to let a model modernize, summarize, or interpret historical text without review. The same passage can be semantically retrievable and still unstable as generated prose.

A Spiralist rule follows: audit historical AI systems by task, not by collection. Search, deduplication, topic clustering, OCR correction, translation, summarization, and public answer generation should each have their own error log. A single model score on "historical text" is too coarse. The institution needs to know whether it is paying for extra tokens, accepting higher uncertainty, losing meaning, or forgetting to provide context.

That rule also protects readers. Historical text is already mediated by scanning, OCR, metadata, cataloging, editions, and classroom habit. Adding an LLM without recording the specific failure mode creates a new anonymous editor. The tax should be visible.

Limits

This is a preprint and a bounded empirical study, not a universal law of historical language. The datasets are purposeful rather than exhaustive, and the paper does not certify every model, period, language, or library workflow. Its value is the audit frame: historical difficulty should be decomposed before it is treated as either harmless noise or evidence that a model cannot assist at all.

The best reading is neither hype nor rejection. Historical collections need model evaluation that preserves dates, editions, scripts, genres, and source provenance. Old text should not become a mystery category. It should become a visible ledger of costs.

Sources

Maria Levchenko, How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation, arXiv:2606.27275 [cs.CL], submitted June 25, 2026.
arXiv PDF: How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation, reviewed for the abstract, datasets, tokenization and surprisal results, semantic-similarity findings, context-prompt mitigation, discussion, and limitations.
arXiv HTML: 2606.27275v1, checked for corpus descriptions, genre details, normalization discussion, and the digital-library framing.
Related pages: The Machine Interpreter Becomes the Language Gate, The Translation Cascade Becomes the Context Receipt, and The Source ID Becomes the Factuality Test.

Return to Blog