Blog · arXiv Analysis · Last reviewed June 25, 2026

The Knowledge Base Becomes the Task Interface

A June 2026 arXiv paper by Amit Elhelo, Amir Globerson, and Mor Geva challenges a common shorthand: language models as knowledge bases. The paper's warning is precise. The same fact may be encoded and retrieved differently depending on the task format, so a model audit that asks one kind of question may certify only one interface to the fact.

Fresh Angle

The paper is LMs as Task-Specific Knowledge Bases: An Interpretability Analysis, arXiv:2606.27237 [cs.CL], submitted June 25, 2026. It is not another benchmark leaderboard and not a claim that models possess a human-like store of beliefs. It tests the knowledge-base analogy itself: if a model has acquired a fact, should different query formats draw on the same internal source?

This is a fresh angle for the site because nearby pages on stale memory, vector databases, and source-ID factuality focus on retrieval systems, institutional memory, or citation discipline. Elhelo, Globerson, and Geva look inside parametric memory: the knowledge stored in model weights, and whether it behaves like a unified database at all.

Task-Invariance

The authors define the expected property as task-invariance. A conventional knowledge base should give the same fact through multiple routes: open question, fill-in-the-blank, multiple choice, verification, or a related reasoning prompt. If the model's factual store is genuinely shared, then acquiring the fact in one competent task format should make it available in other competent task formats.

The paper argues that this assumption matters for reliability, editing, and unlearning. If a model gives the correct answer in one format but not another, a single evaluation format can overstate what has been learned. If an edit changes one route to a fact but leaves another route intact, the institution may believe it has updated a model when it has only patched an interface.

Co-Emergence

The behavioral experiment tracks OLMo-3-7B IT across 105 checkpoints: 100 pretraining checkpoints, two midtraining or long-context checkpoints, and three post-training checkpoints. The authors follow facts drawn from subject-relation-object triples and ask whether those facts co-emerge across task formats during training. After filtering, the experiment tests 1,031 fact-task pairs.

The result is blunt. At the main threshold, 47.9 percent of the tested fact-task pairs fail the co-emergence prediction: the fact does not appear in the target task by the step where the shared-knowledge hypothesis would expect it. The pattern remains similar under looser and stricter thresholds, reported as 50.9 percent and 49.2 percent failure rates. That is not proof that no knowledge is shared. It is evidence against treating correct recall in one task format as evidence of a single stable internal fact.

Parameter Masks

The mechanistic experiment asks where the task dependence lives. The authors examine OLMo-2-7B IT, OLMo-2-13B IT, and Gemma-2-9B IT across five relational datasets. They learn sparse binary masks over MLP neurons and attention heads, looking for components that are necessary, sufficient, and specific for individual fact-task pairs.

The paper reports that such subsets can be found. Removing a localized subset can harm performance on its target fact-task pair while having limited effect on the same fact under other tasks or other facts under the same task. The authors then introduce entanglement metrics to measure how cleanly a pair can be separated. Across 15 model-dataset combinations, discrimination tasks such as multiple-choice and verification are more entangled than generation tasks, with mean task entanglement reported as 0.21 versus 0.11.

Chain-of-Thought

The chain-of-thought result is especially useful for governance. The paper tests whether intermediate reasoning helps because it draws on parameter encodings beyond the task currently being evaluated. In one reported setup using Gemma-2-9B IT and a landmark relation, zero-ablating a task's own localized encoding reduces direct-answer accuracy by 20 to 72 percent, while chain-of-thought loses 12 to 30 percent. But when the most damaging other-task encoding is ablated, direct answering falls by at most 8 percent while chain-of-thought drops by 11 to 31 percent.

The interpretation is not that chain-of-thought is magical or transparent. It is that reasoning may route through other task-specific stores. A model can appear to recover a fact because the prompt causes it to use another interface to the same answer. That is operationally valuable, but it complicates claims about where knowledge lives.

Audit Standard

The Spiralist rule is task-format coverage. A factuality audit should not ask only one canonical prompt. It should test the same fact through generation, recognition, verification, negated choice, paraphrase, and task-relevant reasoning formats. A model-editing report should disclose which formats changed, which did not, and whether the edit survived prompts that route through neighboring tasks.

For systems that answer public questions, process records, or support high-stakes decisions, the finding is a warning against treating parametric memory as a database with SQL-like guarantees. The database metaphor encourages a false comfort: update the row, retrieve the row, cite the row. This paper suggests a messier standard. The fact is partly an interface. Governance has to test the routes.

Limits

The paper is a preprint and its behavioral experiment centers on one model family with public checkpoints. The mechanistic analysis covers three open models and five relational datasets, not all model architectures or domains. Its conclusion should therefore be read as a strong caution about the knowledge-base analogy, not as a complete map of every factual representation in every model.

The authors release code and data, which makes the claim more inspectable. Still, the governance burden remains downstream: teams using model memory, editing, or unlearning need to test their own task formats, prompts, domains, and deployment wrappers rather than inheriting one paper's setup as a universal certificate.

Sources

Amit Elhelo, Amir Globerson, and Mor Geva, LMs as Task-Specific Knowledge Bases: An Interpretability Analysis, arXiv:2606.27237 [cs.CL], submitted June 25, 2026.
arXiv PDF: LMs as Task-Specific Knowledge Bases: An Interpretability Analysis, reviewed for the abstract, task-invariance framing, co-emergence experiment, checkpoint count, fact-task pair count, parameter-localization method, model list, entanglement findings, chain-of-thought ablations, conclusion, and code/data release note.
Project repository: TaskInvariance, checked as the code and data release linked from the paper.
Related pages: The Stale Fact Becomes the Memory Ledger, The Vector Database Becomes Institutional Memory, and The Source ID Becomes the Factuality Test.

Return to Blog