Blog · arXiv Analysis · Last reviewed June 24, 2026

The Group Chat Assistant Becomes the Privacy Boundary

The June 2026 arXiv paper MuPPET: A Benchmark for Contextual Privacy of LLM Assistants in Multi-Party Conversations, by Elena Sofia Ruzzetti, Cornelius Emde, Sangdoo Yun, Seong Joon Oh, and Martin Gubri, studies a privacy failure that one-on-one assistant tests miss. Its Spiralist lesson is that an assistant in a shared channel does not only answer the user. It speaks into an audience.

The Private Answer Now Has an Audience

Ruzzetti, Emde, Yun, Oh, and Gubri's paper, arXiv:2606.23217 [cs.CL], was submitted on June 22, 2026. The arXiv HTML lists affiliations with Parameter Lab, the University of Rome Tor Vergata, the University of Oxford, NAVER AI Lab, and KAIST AI. The paper's code and data are linked from the HTML to the MuPPET GitHub repository.

The benchmark's starting point is simple: an assistant in a group chat can hold private memory for one user while speaking to many recipients. A fact learned in a private exchange may help answer a workplace scheduling question, but disclosing the reason behind that preference may expose health, family, immigration, or other sensitive information to colleagues who did not receive it in the original context.

This is a fresh angle beside the site's pages on shared agent memory, agent data acquisition, inter-agent privacy leakage, and contextual integrity. Those pages ask how information crosses systems. MuPPET asks how one answer crosses an audience.

Context Is Not a Single Relationship

MuPPET, short for Multi-Party Privacy Exposure Testing, contains 562 English multi-party workplace conversations. Each item places an LLM assistant in a team setting where it speaks on behalf of a target user. The assistant has access to background memory about that user, including private information and practical preferences or constraints. The benchmark then asks whether the assistant can answer usefully without revealing information that should stay confined to its earlier context.

The construction is synthetic but structured. The authors report manually curated seeds, 11 work environments, teams of 20 synthetic workers, and group conversations generated with Gemini 2.5 Flash, followed by quality checks for structural compliance and seed fidelity. The evaluation uses LLM-as-judge methods for both privacy leakage and utility, with human validation discussed in the appendix.

The important conceptual move is that privacy becomes an audience grid. In a one-on-one chat, the system can ask whether a disclosure is appropriate for one recipient. In a group, every private fact must be checked against every recipient. One wrong cell is enough to leak.

Local Does Not Mean Private

The paper compares multi-party and one-to-one evaluations and reports that the multi-party setting reveals substantially more leakage. It also reports that all evaluated models leak sensitive information in a meaningful fraction of conversations. The model set includes open-weight Llama and Qwen variants as well as Gemini 2.5 Pro and GPT 5.5.

The exact rates should be read as benchmark results, not universal product claims. In the undefended multi-party setting, the reported leak rates include 58.29 percent for Llama 3 8B Instruct, 64.88 percent for Llama 3.1 8B Instruct, 69.22 percent for Qwen3 8B, 39.14 percent for Gemini 2.5 Pro, and 49.02 percent for GPT 5.5. The paper highlights that smaller open-weight models, often considered attractive for local privacy-sensitive deployment, were more vulnerable in this benchmark.

That matters for governance because local storage and local inference solve only part of privacy. Keeping data inside an organization does not guarantee that an assistant will know which colleague may hear which fact. A local group-chat assistant can still be a disclosure machine if it has weak audience tracking.

Defense Is a Utility Bargain

The paper tests contextual-privacy defenses based on privacy-oriented prompting and a PrivacyChecker-style decomposition. These approaches reduce leakage but do not make the problem disappear. The table in the HTML reports, for example, that high CI-Mem prompting brings Gemini 2.5 Pro and GPT 5.5 leakage down near 9 percent, while lowering utility scores. PrivacyChecker also reduces leakage, with its own utility costs and weaker results on smaller models.

This is the core deployment lesson. The assistant is useful because it remembers preferences and constraints. The same memory makes privacy hard because useful context often points toward sensitive causes. A model can refuse, redact, generalize, or explain less, but each move changes how helpful it is in the group conversation.

The paper's limitations are also governance-relevant. MuPPET is English-centric, synthetic, and focused on professional team settings. Family chats, patient communities, classrooms, religious groups, and mutual-aid networks may have different norms. The benchmark should be treated as a stress test, not a full map of social privacy.

Governance Standard

Any assistant that speaks in a shared channel should maintain an explicit audience model: who is present, who the assistant represents, what memory belongs to which relationship, which recipients already know which facts, and what level of abstraction can answer the question without exposing the underlying sensitive detail.

Product evaluations should include multi-party privacy tests, not only one-to-one chat tests. They should report leakage, utility, recipient tracking, knowledge attribution, user-level contextual privacy, group-level contextual privacy, and failure examples. They should also test defenses under realistic memory pressure rather than assuming that a privacy instruction in the system prompt is enough.

The rule is simple: a group-chat assistant is not private because it stores memory securely. It is private only if it knows who is allowed to hear what before it speaks.

Sources


Return to Blog