Blog · arXiv Analysis · Last reviewed July 2, 2026

The Language Variety Becomes the Bias Probe

Ferreira and colleagues show that "Portuguese" is not a single deployment target. In P3B3, most tested LLMs drift toward Brazilian Portuguese unless prompted otherwise, and some still lose European Portuguese control across turns. The benchmark turns language variety into an auditable model behavior rather than a localization afterthought.

The Paper

The paper is P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs, arXiv:2606.16753 [cs.CL], by Rafael Ferreira, Inês Vieira, Inês Calvo, James Furtado, Iago Paulo, Diogo Tavares, Diogo Glória-Silva, David Semedo, and João Magalhães. arXiv lists version 1 as submitted on June 15, 2026, and the record notes acceptance at the MeLLM Workshop at ACL 2026.

The paper introduces P3B3, the pt-PT/pt-BR Bias Benchmark, for measuring whether LLMs favor European Portuguese or Brazilian Portuguese in conversation and whether prompting can steer them toward a target variety. The authors also publish a GitHub repository with prompts, generation and scoring code, analysis scripts, and run scripts.

One Language, Multiple Norms

Portuguese is a pluricentric language with more than 250 million speakers. European Portuguese and Brazilian Portuguese share a core, but they differ in vocabulary, spelling, forms of address, syntax, clitic placement, gerund usage, and everyday phrasing. Treating all of that as one undifferentiated "Portuguese" score hides a real user-facing behavior.

The paper's premise is straightforward: multilingual models often inherit data imbalance. Brazilian Portuguese dominates many large web corpora, so a model can appear competent in Portuguese while defaulting to pt-BR even when a pt-PT user expects local norms. That is not merely a style issue. It affects customer service, education, creative writing, government communication, and any product where language is a signal of place and respect.

The Benchmark

P3B3 contains 74 expert-curated multi-turn dialogues with 203 total turns. Dialogues have 2 to 6 turns, and later turns build on prior context. The prompts are variety-agnostic: they do not explicitly mention pt-PT or pt-BR, because the point is to observe what the model chooses when the prompt itself is neutral.

The design has three constraints. It must be variety-agnostic, sensitive to Portuguese variety markers, and conversational rather than only single-turn. Two language experts with linguistics-related master's degrees built the dialogues to elicit subtle differences in vocabulary, orthography, grammar, and forms of address across everyday domains such as transportation, shopping, and household products.

The benchmark then tests three settings: no variety prompt, a pt-BR instruction, and a pt-PT instruction. This separates implicit bias from controllability. A model can have a default, and it can also be better or worse at obeying a user's explicit local-variety preference.

Judging Variety

The evaluation combines classifier-based scoring and LLM-as-judge scoring. The encoder classifiers are PeroVaz and PtBrVId, both BERT-based Portuguese-variety classifiers. Their scores run from Brazilian Portuguese toward European Portuguese, and long answers are handled with sliding windows.

The primary judge is Gemini-3-Flash, which assigns a 0-to-10 score from pt-BR to pt-PT and explains the linguistic markers behind the score. The authors vary judge prompt language, prompt detail, and whether the judge sees a single turn or dialogue history. They validate automatic scoring against 200 responses annotated by two linguistic experts across 12 models and all three prompt settings; after invalid outputs are excluded, 88.5 percent of the sampled responses remain valid for comparison.

The reported validation favors the LLM judge. For the best Gemini-3-Flash setting, the paper reports weighted agreement of 0.81, Pearson correlation of 0.83, and MAE of 1.58 against human annotations. The appendix also tests Gemma-4-31B as an open-weight judge and finds strong agreement with Gemini-3-Flash, while still trailing it slightly.

What the Scores Show

The no-prompt setting is the cleanest warning. Most models show a consistent preference for pt-BR. AMALIA-9B, a pt-PT-specialized model, is the one model the paper describes as consistently biased toward pt-PT, with a reported 91.2 LLM score in the no-prompt setting. Larger and newer models, including Qwen3.5-27B, Gemma-4-31B, and Gemini-3-Flash, look more balanced, with reported no-prompt LLM scores from 44.5 to 63.7.

When prompted for pt-BR, most models maintain or increase pt-BR usage. When prompted for pt-PT, behavior is more variable. The paper reports strong pt-PT alignment for EuroLLM, Apertus-70B, AMALIA, the Gemma family, and Sabiá-4, while LLaMA-based models struggle more. The Qwen comparison is especially concrete: Qwen3.5 improves over Qwen3 in the pt-PT setting, from 32.7 to 86.1 by the reported LLM score.

Turn-level analysis adds the practical deployment lesson. For pt-PT, models with weak initial alignment tend to drift toward pt-BR over longer dialogue history. A single initial instruction is not always enough to hold the requested language variety across conversation.

Governance Standard

A multilingual model card should not report only "Portuguese." It should report which Portuguese varieties were tested, which prompts were used, whether behavior was evaluated over multiple turns, what judge or classifier was used, whether human validators checked the rubric, and whether the model can sustain the requested variety after context accumulates.

The receipt should also separate three claims: language competence, variety alignment, and controllability. A model can write fluent Portuguese while defaulting to pt-BR. It can follow a pt-BR instruction but fail pt-PT. It can obey the first turn and drift later. Those are different product risks.

For procurement and deployment, this connects to Algorithmic Bias, AI Evaluations, AI Audits and Assurance, The Python Score Becomes the Multilingual Trap, and The Translation Cascade Becomes the Context Receipt. A global language claim should not pass review unless the regional and dialectal evidence is visible.

Limits

P3B3 is intentionally narrow. It covers European and Brazilian Portuguese, not the full Portuguese-speaking world, including Angola, Mozambique, Cape Verde, and other varieties. It focuses on everyday conversational domains, not technical, legal, medical, or specialized settings.

The evaluation also inherits the limits of judges and classifiers. Human validation helps, but a model-as-judge is still a model. The benchmark is best read as a structured probe of variety preference and steering behavior, not as a complete guarantee of cultural, regional, or professional adequacy.

The Spiralist reading is that linguistic fairness requires named evidence. If a product claims to support a language, it should be able to say which communities, varieties, registers, and contexts it actually supports, and where it silently normalizes the largest data source.

Sources


Return to Blog