Blog · arXiv Analysis · Last reviewed June 24, 2026

The Brand Citation Layer Becomes the Reputation Map

The June 2026 arXiv paper How Large Language Models Source Brand Reputation Across Languages and Markets, by Dmitrij Żatuchin, studies where grounded LLMs get brand information before they write answers.

Reputation Moves Upstream

The paper, arXiv:2606.25787v1 [cs.IR], was submitted on June 24, 2026. Its useful move is to look one step before the generated answer. When a user asks an answer engine which company to trust, the visible paragraph is already downstream of retrieval. The sources the system reads constrain what the system can say.

That turns brand reputation into a citation problem. A company's own page may still matter, but an AI search or answer engine can build its account from encyclopedia pages, local newspapers, video platforms, review sites, career portals, public databases, trade press, and scraped fragments of the open web. The reputational object is no longer only the brand message. It is the source set behind the answer.

The Spiralist angle is not marketing advice. It is governance: if model-mediated answers shape public memory, procurement, hiring, investment, journalism, and consumer judgment, then the citation layer becomes part of the public record.

What the Paper Measures

Dmitrij Żatuchin merges three Rankfor.AI citation datasets covering 128 brands across 12 home markets and 13 languages. The paper reports 189,974 total attribution rows, of which 167,551 are URL-grounded citations suitable for domain analysis. The URL-grounded backbone consists mainly of Nordic-Baltic data and Poland data: 131,514 citations from the Nordic-Baltic set, 35,880 from Poland, and 157 URL rows from the Central and Eastern Europe set.

The method reduces each URL citation to a registrable domain and classifies source type. An owned citation is counted when the brand token appears in the cited domain. The paper also records a load-bearing data-cleaning step: 23,027 Nordic-Baltic Gemini rows carried the Google grounding redirector host vertexaisearch.cloud.google.com, while the actual source domain was in the citation title. Without resolving those redirectors, domain analysis would be wrong.

Third-Party Web Becomes the Brand File

The headline finding is blunt. On the Nordic-Baltic URL-grounded backbone, 85.7% of citations point to sites the brand does not own, while 14.3% point to owned domains. Read against all Nordic-Baltic rows, including no-URL implicit rows, the split is 14.4% owned, 76.1% third-party, and 9.5% implicit. The paper's interpretation is bounded but important: these citations are not proven reputation, but they are what the generated answer is built from.

This matters for more than commerce. If an answer engine describes a school, hospital, bank, union, city agency, news outlet, employer, or political organization by reading mostly third-party sources, then institutional self-description is no longer the main supply. The public profile becomes an aggregation of sites that may be useful, stale, adversarial, paid, local, copied, or simply dominant in the retriever.

This extends the site's existing work on answer engines, source-aware factuality, and AI slop entering the knowledge supply chain. The question is not only whether an answer cites something. It is whether the cited ecology is fit to carry reputation.

The Head of the Tail

The paper finds a concentrated source base. On 20,815 registrable domains, 80% of citations come from 3,778 domains, about 18.2% of the base. Half of all citations come from 547 hosts, while the last 20% spreads across more than 17,000 tail domains. A log-log regression on the top 1,000 domain ranks gives a Zipf exponent of 0.86 with R squared of 0.983.

The top domains are not surprising, which is why they are politically important. Across the Nordic-Baltic set, Wikipedia is the most-cited domain overall, followed by YouTube and Statista, then owned sites and social platforms such as Reddit. Wikipedia is also the top source in 11 of 12 languages. The exception is Lithuanian, where the business daily vz.lt leads at 4.38% of Lithuanian citations.

In other words, answer engines do not read the whole web evenly. They read a small head very heavily and a long tail thinly. Reputation becomes path dependent: the domains that already concentrate attention become the domains that train the next layer of attention.

Markets and Models

The market differences are not decorative. In the Poland set, after resolving redirectors, YouTube is the most-cited single domain for 46 national brands, with 2,289 citations, or 6.4%. Four HR and career portals together supply 637 citations, compared with 297 for Polish Wikipedia. For brand reputation, an employer-profile ecosystem can matter more than the encyclopedia.

Model behavior also differs. On the Nordic-Baltic backbone, Perplexity Sonar Pro contributes 90,276 of 131,514 citations and grounds in 15,995 domains. Gemini 3.1 Pro contributes 23,032 citations across 6,568 domains, and GPT-5.4 contributes 18,206 across 3,284 domains. The Gemini redirector artifact shows why source analytics must inspect the pipeline: before resolution, a measurement system can report infrastructure noise as if it were a real source pattern.

Limits That Matter

The paper's limitations are unusually useful. The merged dataset is not one homogeneous table. The Central and Eastern Europe data is mostly keyword-attributed, not URL-resolvable, so it does not support the same domain-level claims. Owned-site detection is a heuristic based on brand tokens in domains. A citation attaches to a response, not always to one named entity inside a multi-brand answer. The datasets reflect specific model versions and collection windows. Most important, the paper measures what LLMs cite; it does not claim those citations equal real-world reputation or human perception.

Those limits are the reason the page belongs here. They show what a serious answer-engine audit has to preserve: source URLs, resolved domains, attribution units, version and time scope, entity-level links, and denominator choices.

Governance Standard

A consequential answer engine should expose source composition as an auditable property, not merely show a few citations. For brand, civic, medical, financial, educational, and legal queries, the system should preserve the query class, model or engine version, retrieval time, cited URLs, resolved domains, source types, ownership heuristic, entity anchoring, and claim-to-source support.

For public-facing reputation, the platform should also report concentration. If a small set of domains supplies most answers, users and regulators should know which domains those are, whether original sources are being displaced by copies, whether local sources dominate in particular languages, and whether generated or low-quality pages are entering the source head.

The practical rule is simple: an answer about an institution is also an answer about the web the model read. If the citation layer is opaque, reputation becomes something produced by a hidden retrieval economy.

Sources

Dmitrij Żatuchin, How Large Language Models Source Brand Reputation Across Languages and Markets, arXiv:2606.25787 [cs.IR], submitted June 24, 2026.
arXiv PDF version of How Large Language Models Source Brand Reputation Across Languages and Markets, reviewed June 24, 2026.
arXiv experimental HTML version of How Large Language Models Source Brand Reputation Across Languages and Markets, reviewed June 24, 2026.
Zenodo dataset record, How LLMs Source Brand Reputation Across Languages and Markets: A Cross-Market Citation Dataset (2026), DOI 10.5281/zenodo.20829524, published June 24, 2026.
Related pages: AI Search and Answer Engines, The Answer Engine Becomes the Front Page, The Source ID Becomes the Factuality Test, The Crawler Becomes the License Gate, and The Search Remedy Becomes AI Governance.

Return to Blog