Blog · arXiv Analysis · June 25, 2026

The Personality Test Becomes the Category Error

Kim Zierahn, Cristina Cachero, Anna Korhonen, and Nuria Oliver's July 2026 arXiv paper asks whether Big Five personality inventories can validly characterize large language models.

For this essay, a construct-validity receipt is the record that names the target population, inventory, item adaptation, prompt template, model sample, score distribution, variance decomposition, factor structure, and governance claim before a personality score is treated as model evidence.

The Claim

The paper, arXiv:2607.02325 [cs.HC], was submitted on July 2, 2026 under the title Personality Without Persons? A Psychometric Critique of Big Five Testing in Large Language Models. Its central claim is narrow and important: Big Five scores for LLMs should not be treated as evidence of a construct equivalent to human personality.

That does not mean model behavior is all the same, or that tone and style are irrelevant. It means a familiar human measurement instrument cannot simply be moved onto a nonhuman text system and then used to compare, benchmark, or govern models without first proving that the instrument still measures what its labels imply.

The Transfer Problem

The authors evaluate whether Big Five inventories appropriately describe LLMs, capture meaningful differences across models, and recover internal factors consistent with the human Big Five structure. They assess five candidate inventories for content validity, then administer the selected inventory to N = 244 models spanning 49 model families.

The first lesson is already a warning to governance teams: content validity cannot be assumed. The paper reports that LLM-adapted items can reach sufficient content validity, while the original human-developed items do not. A questionnaire written for human self-report contains assumptions about experience, motivation, memory, social life, and self-knowledge. A language model can answer those items fluently while the item remains a bad descriptor of the object being measured.

The Three Findings

The second result is a variance problem. The paper reports that between-model variance accounts for only 3 percent of total score variance. Most observed variation is instead attributed to item formulation, model-item interaction, and residual variation. If the score barely distinguishes the evaluated systems, then ranking models by those scores becomes a weak governance move even before the labels are questioned.

The third result is structural. The authors report that responses do not recover the human five-factor structure. Four of the Big Five facets collapse into a highly collinear cluster, with correlations at or above r >= .92, while the overall confirmatory factor analysis fits poorly. In ordinary terms: the answer patterns do not behave like five separable human personality dimensions.

That combination is the category error. A model can produce an agreeable-looking answer, a conscientious-looking answer, or a low-neuroticism-looking answer. But if the instrument does not validly map to the target population, does not meaningfully separate models, and does not reproduce the intended factor structure, the trait label is doing more rhetorical work than measurement work.

The Alignment Artifact

The most governance-relevant comparison is between base models and instruction-tuned variants. Holding architecture and pretraining closer than broad cross-model comparisons allow, the paper reports that instruction-tuned variants score higher on Openness, Conscientiousness, Extraversion, and Agreeableness, and lower on Neuroticism. The authors interpret this as evidence that alignment and instruction fine-tuning push responses toward socially desirable profiles.

That makes a Big Five score less like a personality reading and more like a product behavior residue. It may reflect helpful-assistant training, refusal style, politeness norms, and prompt sensitivity. Those are real deployment properties, but they are not the same as a human trait construct. A buyer or regulator who reads the label without the receipt may mistake a post-training artifact for a stable system property.

Governance Reading

The Spiralist reading is that anthropomorphic labels are governance shortcuts. They are tempting because they compress many observations into a word users already understand. They are dangerous because they smuggle human assumptions into systems that produce social language without human interiority, biography, or accountability.

This matters for AI safety and procurement. A dashboard that says one model is more agreeable than another may guide product choice, child-safety policy, mental-health triage, tutoring design, workplace automation, or companion defaults. If the underlying score mostly captures item wording and instruction tuning, then the dashboard is not neutral measurement. It is a metaphor wearing a number.

Construct-Validity Receipts

A construct-validity receipt should record the inventory version, item source, adaptation process, expert content-validity method, prompt template, response format, model list, family and release metadata, repetition scheme, score aggregation, variance decomposition, factor-analysis results, base-versus-instruction comparison, and excluded models or failed administrations.

It should also record the governance claim being made. Measuring a conversational style for a chatbot is different from claiming a model has a human-like personality dimension. Measuring sycophancy, prompt sensitivity, manipulation resistance, calibrated uncertainty, or instruction-following consistency may be more useful because those constructs can be defined around observable model behavior and deployment risk.

Limits

The paper's own limitations matter. Content validity was assessed by three expert raters, two of whom were authors. The model sample covers systems available through public APIs as of early 2026, the evaluation is English-only, and the authors caution that self-report analogues may not capture actual behavioral tendencies. They also note that the prompt announces a psychological evaluation, which may itself shape model responses.

Those limits do not weaken the page's main use. They strengthen it. The appropriate conclusion is not that one paper settles LLM behavior measurement, but that personality-score claims now need receipts. Without those receipts, the public is being handed a human word, a machine response, and a confidence it has not earned.

Source Discipline

This page uses the arXiv abstract, arXiv HTML paper, arXiv PDF text, and arXiv API metadata as primary sources for title, authorship, submission date, research questions, inventory count, model sample, variance finding, factor-structure finding, base-versus-instruction comparison, recommended LLM-native constructs, and disclosed limitations. It does not independently rerun the authors' evaluation.

Sources


Return to Blog