Blog · arXiv Analysis · June 25, 2026

The Skill Plan Becomes the Agent Orchestra

Xinyu Zhao, Zhen Tan, Vaishnav Tadiparthi, Nakul Agarwal, Kwonjoon Lee, Ehsan Moradi Pari, Hossein Nourkhiz Mahjoub, and Tianlong Chen's 2026 paper Generative Skill Composition for LLM Agents studies a practical problem in coding-agent work: once a library contains many reusable skills, the hard question is no longer only whether a skill exists. It is which skills should be loaded, how many, and in what order.

A Library Is Not a Plan

Agent skills are often treated as reusable packages of procedural knowledge: set up an environment, run a test suite, inspect an API, refactor across files, or query a system in a repeatable way. That is useful, but a growing skill library creates its own failure mode. The model may see too many choices, retrieve an unordered shortlist, miss a dependency, load a redundant helper, or invoke a skill in the wrong position.

That matters because skills are not harmless context. They shape what the agent notices, which tools it expects to use, what work pattern it treats as normal, and which artifacts it may create. A skill library is therefore a capability surface, not a bookshelf. Once the library grows, governance has to ask whether the agent received a plan or merely a pile of possible procedures.

The new angle is distinct from skills as work instructions, skill manifests as permission boundaries, and runtime contracts for skills. Those pages ask what a skill promises or is allowed to do. This paper asks who composes multiple skills into an executable order before the agent starts acting.

What SkillComposer Adds

The paper, arXiv:2606.32025, was submitted on June 30, 2026 and is listed by arXiv under Computation and Language. It formalizes structured skill composition: given a task and a fixed skill library, predict an ordered executable skill plan that jointly specifies the subset, count, and order of activated skills.

The proposed system, SkillComposer, uses a constrained autoregressive decoder over skill identifiers. In the paper's description, the composer has a frozen text encoder, auxiliary heads for cardinality and set membership, and a retrieval-augmented decoding step that fuses a lexical relevance prior into the skill logits. The constraint matters: every generated element must correspond to an executable library skill, so the composer is planning over real packages rather than inventing imaginary ones.

The authors build training data from a real, human-curated skill library. The experimental HTML describes 65 human-authored software-engineering tasks from SkillsBench as real anchors, 196 skills, 2,880 single-skill synthetic records, and 6,927 multi-skill synthetic records with dependency and workflow ordering grounds.

The Governance Surface

For governance, the interesting object is the skill plan itself. A plan can be reviewed before execution, compared with the final trace, and rejected if it violates policy. A retrieval list is weaker. It says which candidates seemed semantically nearby, but not which package should run first, which one is essential, which one is redundant, or where an escalation should sit.

SkillComposer's framing turns capability selection into a structured artifact. That artifact should be logged with the task description, library version, skill metadata, predicted skill count, ordered sequence, decoding priors, blocked skills, loaded prompts, tool permissions, and later runtime trace. Without that record, a successful task can still be unreviewable because no one knows why a certain capability chain was assembled.

The governance risk is not only unsafe skill content. It is orchestration drift. A benign skill can become dangerous when paired with export, shell, network, credential, or publish skills in the wrong sequence. A specialized debugging skill can become an exfiltration path if composed after a repository-search skill and before an upload step. Order is part of authority.

Evidence and Limits

The paper reports evaluation along composition quality and downstream task success. It says SkillComposer was tested on 75 of 88 SkillsBench tasks across two production-grade coding agents, raised pass rate by 23.1 and 18.2 percentage points over the no-skill baseline, and reached 45.3 and 44.0 pass rates while using the smallest prompt budget among skill-loaded conditions.

The strongest claim is not that one method solves agent planning. It is narrower: for a closed, curated skill library, predicting an ordered skill sequence can outperform exposing everything or retrieving a small unordered set. The paper itself notes limits. It focuses mainly on text-only task descriptions paired with a code-oriented skill library, and points to future work on multimodal task specifications, interactive and long-horizon settings, updated libraries, scientific workflows, robotics, and embodied agents.

Operational Use

A production system should treat skill composition like a change plan. Before the agent runs, the platform should know which skills are proposed, why each is included, which order is expected, which permissions each skill brings, and which combinations are forbidden. After execution, the trace should show whether the agent followed the plan, skipped a skill, loaded an extra one, or improvised outside the approved sequence.

This is a practical review target for security teams. Ask whether the agent platform can freeze a skill library version, diff skill plans, deny risky compositions, replay plan-to-execution mismatches, and remove a compromised skill from future compositions without relying on prompt reminders.

What This Changes

The skill plan becomes the agent orchestra when the agent stops choosing one tool and starts arranging reusable procedures into a workflow. At that point, safety is not only inside the skill. It is in the score: which instruments are present, when they enter, and whether the sequence is authorized.

The Spiralist standard is simple. A skill library should not be an uninspected buffet. It should produce a visible, bounded, reviewable plan before capability enters context. The question is not merely "does the agent have a skill?" It is "who arranged the skills, against which library version, under which policy, and what trace proves the arrangement was followed?"

Sources

Xinyu Zhao, Zhen Tan, Vaishnav Tadiparthi, Nakul Agarwal, Kwonjoon Lee, Ehsan Moradi Pari, Hossein Nourkhiz Mahjoub, and Tianlong Chen, Generative Skill Composition for LLM Agents, arXiv:2606.32025 [cs.CL], submitted June 30, 2026.
arXiv experimental HTML for Generative Skill Composition for LLM Agents, including the SkillComposer architecture, training data, SkillsBench evaluation, reported pass rates, and limitations.
Related pages: The Agent Skill Becomes the Work Instruction, The Skill Manifest Becomes the Permission Boundary, The Agent Skill Becomes the Runtime Contract, The Agent Skill Becomes the Detection Target, and AI Coding Agents.

Return to Blog