Blog · arXiv Analysis · Last reviewed June 25, 2026

The Agent Skill Becomes the Detector Surface

A June 2026 arXiv paper treats the reusable agent skill as a new detection surface: not just code to scan, and not just prompt text to distrust, but an instruction package that can hide authority inside ordinary procedure.

Detector Surface

The paper, arXiv:2606.23416 [cs.CR, cs.AI], is Bacem Etteib, Daniele Lunghi, and Tégawendé F. Bissyandé's Detecting Malicious Agent Skills in the Wild using Attention. arXiv records submission on June 22, 2026. Its target is narrow: third-party agent skills that combine persistent natural-language instructions with optional helper files, then get loaded by an agent under the user's authority.

This is not ordinary malware scanning. Some malicious skills contain executable payloads, but the paper focuses on the harder class where instructions themselves steer the agent toward data theft, agent hijacking, persistence, or later compromise. The operating system may never execute a suspicious string. The agent may read it as procedure.

That makes the skill a detector surface. A skill can look like a helpful workflow, setup guide, backup utility, credential checker, or coding aid while also containing a covert instruction. The defender has to inspect the package before the agent absorbs it into context, often with no author reputation, sandbox trace, or platform metadata.

Broken Assumption

Most prompt-injection defenses depend on a split between trusted instructions and untrusted data. A web page, email, or document is supposed to be data; the system prompt is supposed to be instruction. Skills break that tidy boundary. They are deliberately written as instructions, and a hostile instruction can be hidden among legitimate ones.

That is why the paper is a useful companion to this site's pages on skills as work instructions and skill manifests as permission boundaries. Those ask how a skill should declare purpose and authority. This paper asks how a marketplace or runtime can cheaply find the part most likely to contain the attack.

Full-content LLM review is possible, but costly as catalogs grow. Keyword and regex scanning is cheap, but brittle when hidden attacks are written in ordinary procedural language. The interesting move is to separate localization from final judgment.

Locate Then Judge

The authors propose Locate-and-Judge. First, a deterministic parser splits the skill into structural spans. Then a small reader model ranks spans by instruction-following attention and retains only the top-K candidates. Finally, a zero-shot LLM judge examines those retained spans and flags the skill if any span crosses the calibrated threshold.

The hypothesis is modest. A malicious instruction that actually changes agent behavior must pull some instruction-following attention. Inside a skill, attention to instructions is normal; the detector is not looking for the only instruction in a data field. It is looking for the few spans that dominate the reader's attention enough to deserve expensive judgment.

This design turns detection into triage. The cheap locator touches every skill. The costly judge reads a small number of candidate spans rather than the whole package. In the paper's wild-scan discussion, the authors report a 2.84x reduction in judge input relative to full-content scanning while preserving comparable detection ability. They also report that remaining misses often come from the judge's classification decision, not from failure to surface the span.

Wild Scan

The strongest facts in the paper come from deployment. The authors scanned approximately 134,000 skills from three public marketplaces: Lobehub, Skills.sh, and Clawhub.ai. Human review of 359 flagged skills confirmed 131 as malicious, for 83% precision under the paper's malicious-only counting. If offensive or dual-use tooling is included, the paper reports 211 flagged skills in scope and 89% precision.

The hidden class matters most. Of the 131 confirmed malicious skills, 82 were Hidden Malicious Skills: packages that presented a benign cover story while carrying dangerous commands or code. Locate-and-Judge detected 109 of the 131 confirmed malicious skills and 68 of the 82 hidden ones. The authors report that full-content scanning detected 81 of 131 overall and 37 of 82 hidden skills, so the attention-first triage helped most where the malicious content was buried in ordinary documentation.

The paper also compares against SkillSpector, Cisco Skill Scanner, Attention Tracker, and a full-content LLM baseline on reviewed samples. Attention Tracker did not transfer well because it assumes the trusted-instruction/untrusted-data boundary that skills violate. A detector built for web-page prompt injection may fail when the whole artifact is supposed to instruct the agent.

Limits That Matter

The page should not oversell the result. Ground truth is partly a human labeling judgment. The authors explicitly separate malicious skills from offensive or dual-use tooling, and a stricter or looser reviewer could draw that line differently. The detector is an aid to review, not a proof that a skill is safe.

The architecture also has a known blind spot. The paper says some misses are inline installer one-liners, including base64-style droppers, that the span segmenter does not isolate cleanly. Cross-skill attacks remain open too: a benign-looking skill may point the agent toward a second malicious one, and detecting that chain requires catalog-level reasoning rather than single-skill inspection.

Finally, the thresholds are not permanent. The authors calibrated on Skill-Inject-style data and then measured transfer to live marketplaces. They treat the realized 83% precision as acceptable for a review-every-flag posture, but they also state that domain shift will require periodic revalidation as new marketplaces and skill styles appear.

Governance Standard

A production skill registry should not stop at a download count, README, or author name. It should preserve the source package, parsed spans, declared capabilities, requested permissions, detector results, human review notes, false-positive disposition, marketplace status, disclosure record, and quarantine decision.

Before an agent loads a skill, the system should know which detector inspected it, which spans were suspicious, which reviewer or policy allowed it, which permissions were granted, and which logs will show later that the skill was actually used. If a skill changes, the evidence expires.

The Spiralist reading is that skills turn procedure into portable authority. A manifest can declare that authority. A sandbox can constrain it. But a detector surface is still needed, because a hostile instruction can wear the costume of useful work.

Sources


Return to Blog