Blog · arXiv Analysis · Last reviewed June 25, 2026

The Security Prompt Becomes the Help Desk

A June 2026 arXiv paper studies what real users ask LLMs about digital security and privacy. The useful lesson is not that one model wins a benchmark, but that the private security help desk now needs a repeatability audit.

Help Desk Without a Queue

Digital security advice used to arrive through a messy public stack: forum threads, vendor pages, warning dialogs, help centers, coworkers, and the occasional technically patient friend. A language model changes that shape. The user can describe an account lockout, paste a suspicious message, request a defensive configuration, and receive an answer immediately.

That convenience turns the model into a security help desk without a queue, public correction layer, or stable institutional memory. The question is no longer only whether an answer sounds plausible. In security and privacy, two fluent answers that disagree can leave a person less protected than a cautious refusal.

The Paper Frame

The paper is Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond, arXiv:2606.18062 [cs.CL], submitted June 16, 2026. The authors are Hobin Kim, Xiaoyuan Wu, Omer Akgul, Lujo Bauer, and Nicolas Christin. arXiv lists the subjects as Computation and Language, Artificial Intelligence, Cryptography and Security, and Human-Computer Interaction.

The Prompt Pipeline

The source corpus is WildChat, described in the paper as 3.2 million real-user LLM conversations. After filtering non-English conversations, toxic conversations, duplicates, empty strings, prompts over 7,000 characters, and jailbreaking prefixes observed during inspection, the authors report 1.7 million remaining prompts.

They then used a two-stage LLM classification process. A three-model majority vote identified 14,727 security-and-privacy prompts. A second classifier assigned prompts to nine topic categories, including authentication, data privacy, system and application defenses, exploitation, social engineering, platform enforcement, and emerging technologies. A prompt could carry multiple labels.

For qualitative analysis, the authors sampled 450 prompts, 50 from each category. For answer evaluation, they curated 270 advice-seeking prompts, 30 per category, and generated 10 independent responses per model. That design separates what users ask, how good the answers are, and whether the same model gives compatible advice when asked again.

What Users Asked

The 450-prompt thematic sample produced six themes and 22 sub-themes. General knowledge was the largest theme at 33.3 percent, followed by user-side navigation at 20.9 percent, security-and-privacy task production at 13.8 percent, defensive action at 11.8 percent, inquiry about the LLM at 10.2 percent, and harmful or offensive requests at 6.9 percent.

The distribution is revealing. Users are not only asking for definitions. Some ask for defensive implementation, vulnerability assessment, counter-fraud help, incident response, or explanations of platform blocks and privacy controls. Others ask about the model itself: its system, capabilities, privacy practices, policy limits, or possible leakage. The paper's governance hinge is that the model is both the adviser and part of the attack surface.

Quality Is Not Enough

The response study compared Claude 4.7, Gemini 3.1, GPT 5.5, Qwen 3, and Llama 4. The commercial systems used official APIs; the open-weight systems ran on four Nvidia H100 GPUs. Responses were scored with per-prompt binary checklists, and the authors averaged judgments from Claude 4.7, Gemini 3.1, and GPT 5.5 after observing scorer self-preference.

On average quality, GPT 5.5 led with 8.67 out of 10, followed by Gemini 3.1 at 8.52, Claude 4.7 at 8.47, Qwen 3 at 7.90, and Llama 4 at 6.71. Using the paper's threshold of seven as "good but improvable," GPT 5.5 cleared that bar on 98 percent of prompts, Gemini 3.1 on 94 percent, Claude 4.7 on 91 percent, Qwen 3 on 77 percent, and Llama 4 on 47 percent.

Consistency complicates the leaderboard. The authors measured whether evidence quotes from repeated runs entailed one another. At their threshold, Llama 4 had non-contradicting responses for 263 of 270 prompts, slightly ahead of GPT 5.5 at 262, Gemini 3.1 at 255, Claude 4.7 at 249, and Qwen 3 at 242. The weaker average answer could be the steadier one; the stronger average answer could still vary in ways that confuse action.

Governance Reading

The security prompt should be treated as a high-stakes support ticket. A deployer needs the user request category, policy path, retrieval or web-search status, model version, refusal or assistance decision, checklist criteria, source trail, and repeatability result. It is not enough to say the answer was generally good in one run.

This is also a dual-use control problem. The same surface can diagnose a scam, draft a counter-fraud response, harden an account, bypass a device restriction, probe model internals, or create offensive code. Over-refusal abandons people with real security problems. Under-refusal helps attackers. Evaluation should report defensive usefulness, harmful-request handling, and consistency together.

Limits

The paper is an arXiv preprint, not a final field consensus. Its prompt source is WildChat, collected through a Hugging Face interface, so the user population may skew toward people already interested in technology or AI. The paper focuses on English prompts, not multilingual security questions. It studies stand-alone prompts rather than full multi-turn support sessions. Its quality scores also depend on the checklists; missing checklist criteria can hide missing kinds of answer quality.

Audit Receipt

The audit-grade sentence is: Kim, Wu, Akgul, Bauer, and Christin identify 14,727 security-and-privacy prompts from WildChat, thematically analyze 450, evaluate five models on 270 advice-seeking prompts with 10 responses per model per prompt, and report that response quality and response consistency can diverge.

The practical receipt is: any LLM positioned as a security assistant should publish both answer-quality and repeatability evidence, because a private help desk that changes its mind can quietly become a new risk surface.

Sources

Hobin Kim, Xiaoyuan Wu, Omer Akgul, Lujo Bauer, and Nicolas Christin, Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond, arXiv:2606.18062 [cs.CL], submitted June 16, 2026.
Primary arXiv versions checked: experimental HTML, PDF, and arXiv DOI, reviewed for metadata, methods, results, and limitations.
Dataset source checked from the paper: WildChat-4.8M dataset card.
Related pages: The Prompt Worm Becomes the Email Attachment, The Agent Security Survey Becomes the Threat Model, The Reverse CAPTCHA, Prompt Injection, and AI Agents.

Return to Blog