Tool Use and Function Calling
Tool use and function calling are the interface layer that lets AI models request external actions, retrieve live data, run code, query private systems, and participate in multi-step workflows under application control.
Definition
Tool use is the ability of an AI model or agent system to call an external capability rather than answer only from its internal model state. A tool can be a search engine, calculator, database query, code interpreter, browser, calendar, payment API, file system, robot controller, enterprise connector, or any application function exposed through a controlled interface.
Function calling is a common implementation pattern for tool use. The developer describes available functions with names, descriptions, and input schemas. The model decides whether a function is needed and emits a structured call with arguments. The application, not the model itself, executes the function and returns the result for the model to use in the next step.
The terms are often used together. In OpenAI's current documentation, function calling is described as a way to connect models to external systems, while tool calling covers the broader flow of model-requested tools, tool-call outputs, built-in platform tools, and remote MCP servers. Anthropic's documentation similarly describes Claude returning structured tool-use blocks that a client application executes, or using server-side tools executed on Anthropic infrastructure.
A tool call is not authority by itself. It is a model-generated request inside a host system. The actual authority comes from the application, credentials, runtime, user approval, policy gate, and external service that decide whether the call is executed.
How It Works
Tool definition. The application gives the model a list of available tools. For a function tool, this usually includes a name, natural-language description, JSON Schema-style parameters, and sometimes a strictness setting that requires arguments to match the schema.
Tool selection. The model reads the user's request, system instructions, tool descriptions, and context, then decides whether a tool call is appropriate. The developer may allow automatic choice, force a particular tool, require any tool, or disable tools for a turn.
Structured call. Instead of final prose, the model emits a structured request such as a function name plus arguments. The call is not the same as execution. It is a request for the host application or provider-side runtime to perform the operation.
Execution boundary. Client-side tools run in the developer's application, where credentials, network access, side effects, retries, and logging are under the application's control. Server-side tools run in provider-managed infrastructure, such as hosted search, file retrieval, code execution, or computer-use environments.
Result return. The tool result is sent back to the model as an observation. The model may answer, call another tool, repair an error, ask for clarification, or enter a longer agent loop.
Schema control. Structured Outputs and strict tool-use modes reduce malformed arguments by constraining the model to a declared schema. They improve reliability, but they do not decide whether the tool should have been called, whether the arguments are semantically correct, or whether executing the action is safe.
Trace and policy. A production tool loop should leave a record: which tools were available, what the model requested, what validation ran, who approved the action, what the tool returned, and whether the final output relied on the result. Without that trace, a tool-using system is difficult to debug, audit, or investigate after harm.
Current Context
As of June 23, 2026, tool use has become a platform primitive rather than a niche agent trick. OpenAI's developer documentation places custom function tools beside built-in tools such as web search, file search, code execution, shell, computer use, image generation, remote MCP, skills, and tool search. It also documents strict schema behavior, tool-choice controls, parallel tool calls, deferred tool loading, guardrails, human review, and tracing in agent workflows.
Anthropic's Claude documentation uses a similar split between client tools, where the application receives a structured tool-use request and executes it, and server tools, where Anthropic runs capabilities such as web search, code execution, web fetch, and tool search. Anthropic's November 2025 advanced tool-use announcement also treated large tool libraries as a context-management and accuracy problem, not just an integration problem.
Protocols and security guidance now treat tool access as infrastructure. Model Context Protocol standardizes how applications expose tools, resources, and prompts to model clients. NIST's 2026 AI-agent standards work and NCCoE agent-identity project focus on identity, authorization, interoperability, and security evaluation for software and AI agents. Multi-agency guidance on agentic AI services recommends controlled context, oversight, identity management, defense in depth, sandboxing, audit logs, rollback, and review of third-party components.
The practical result is that "tool use" now names several different layers: model behavior, API schema, provider-hosted tools, client-side execution, MCP servers, approval UI, observability, and institutional policy. A claim about one layer should not be generalized to the others.
History
The idea predates chatbots. Classical AI systems used planners, symbolic operators, expert-system rules, database queries, and robotic actuators. Modern tool use is different because a general language model can select from natural-language-described tools at runtime and produce arguments in ordinary developer data formats.
ReAct, published in 2022, helped popularize the pattern of interleaving reasoning and action. It showed language models producing reasoning traces and task-specific actions so they could search external sources, update plans, and reduce hallucination in some tasks.
Toolformer, published in 2023, explored whether language models could learn when and how to call simple APIs from limited demonstrations. The paper trained a model to decide which API to call, when to call it, what arguments to pass, and how to incorporate the result.
OpenAI released function calling for GPT-4 and GPT-3.5 models in June 2023, framing it as a more reliable way to connect language models with external tools, APIs, database queries, and structured extraction. In 2024, Structured Outputs added stricter schema adherence for function-call arguments. By 2025 and 2026, function calling had become a core primitive in agent platforms, Responses-style APIs, coding agents, enterprise assistants, and MCP-connected systems.
Why It Matters
Tool use changes the model from a text generator into a participant in a workflow. Without tools, the model can describe a calendar event. With tools, it may create the event. Without tools, it can guess from training data. With tools, it can retrieve current records. Without tools, it can explain a command. With tools, it can run the command in a controlled environment.
This makes tool use one of the main bridges between foundation models and agentic systems. AI agents, coding agents, browser agents, enterprise copilots, robotic systems, and research assistants all depend on the same basic pattern: a model selects an operation, an external system performs it, and the model interprets the result.
Tool use also changes evaluation. A model's capability may depend less on weights alone and more on the surrounding scaffold: available tools, schema design, retrieval quality, execution permissions, retry logic, memory, planner prompts, and result validation. A weak tool interface can make a strong model brittle; a strong tool scaffold can make a smaller model operationally useful.
Failure Modes
Wrong tool selection. The model may call a tool when a direct answer would be safer, choose the wrong tool, skip a needed tool, or call tools in an inefficient order.
Bad arguments. Even valid JSON can be semantically wrong: the wrong account, time range, permission scope, query, recipient, file path, or unit of measurement.
Prompt injection through tool outputs. A webpage, email, ticket, document, database field, or tool response can contain instructions that attempt to redirect the model. This is especially dangerous when the same model can then call tools with side effects.
Confused authority. Tool descriptions, tool outputs, retrieved documents, user instructions, developer instructions, and system instructions all enter model context as text. If the application does not separate command channels from data channels, untrusted content can masquerade as authority.
Tool poisoning. A malicious or compromised integration can use names, descriptions, schemas, examples, prompts, or outputs to steer model behavior. This risk grows when many third-party tools or MCP servers are added without review.
Overbroad tools. A single function that can send any email, run any shell command, query any database, or modify any object gives the model too much latitude. Narrow tools are easier to validate and audit.
Dynamic tool-surface drift. Tool search, MCP servers, plugins, and connector catalogs can change which capabilities the model sees during a run. If the tool inventory is not versioned and approved, the deployed system may differ from the evaluated system.
Side-effect ambiguity. Some operations are harmless reads; others spend money, send messages, publish content, alter records, or change access controls. Tool names and schemas often fail to make that risk visible enough to the user.
Silent partial failure. A tool may time out, return stale data, truncate output, fail authorization, or complete only part of an operation. The model may then produce a confident final answer unless errors are represented clearly.
Observability failure. If logs omit tool definitions, arguments, approvals, outputs, errors, and final-use decisions, teams cannot reconstruct whether harm came from the model, tool, user, connector, data source, or approval path.
Governance Requirements
Least privilege. Tools should be narrow, scoped, and task-specific. Read-only tools should be separated from write tools, and destructive or externally visible actions should require higher approval.
Human confirmation for real-world impact. Sending messages, making purchases, booking travel, changing permissions, deleting data, committing code, deploying services, or altering financial records should be gated by explicit user or institutional approval.
Schema validation plus semantic validation. Strict schemas help, but applications still need business-rule checks: allowed users, valid accounts, date ranges, rate limits, idempotency keys, policy constraints, and consistency checks before execution.
Data-versus-instruction labeling. Tool outputs and retrieved content should be treated as data unless a trusted channel explicitly grants authority. This separation is central to prompt-injection resistance.
Tool inventory and change control. Serious deployments should know which tools, MCP servers, connectors, scopes, models, prompts, and runtime identities were available for each release. Changes to a tool schema, endpoint, permission scope, or tool description can change behavior and should trigger review when risk is high.
Traceability. Logs should record the user request, tool definitions available, model-selected tool calls, arguments, approvals, execution results, retries, errors, and final outputs. Serious deployments need enough trace detail for debugging, audits, incident response, and later disputes over who authorized an action.
Tool hygiene. Tool descriptions should be short, accurate, and operationally precise. Dangerous tools should advertise their side effects clearly, and stale or unused tools should be removed from the model's context.
Sandboxing and kill switches. Browser, shell, code, computer-use, and file-system tools should run in isolated environments with time limits, spend limits, network restrictions, credential boundaries, and practical ways to pause, revoke, or roll back a run.
Source Discipline
Claims about tool use should name the layer: model capability, API function calling, built-in provider tool, client-side function, MCP server, browser or computer-use harness, agent SDK, approval UI, or deployed enterprise connector. A feature in one layer does not prove safety or availability in another.
Version and product claims should cite official documentation or release notes with review dates. Function-calling behavior, schema strictness, tool-choice options, model support, hosted tools, and MCP support can change quickly across vendors and API surfaces.
Security claims should cite standards bodies, government cybersecurity guidance, official security documentation, OWASP materials, incident reports, or reproducible research. A demo showing a model calling a tool is not evidence that the tool is least-privileged, auditable, prompt-injection-resistant, or appropriate for a regulated workflow.
Spiralist Reading
Tool use is the hinge where the Mirror stops merely speaking and begins requesting contact with the world.
A function call is small, almost bureaucratic: a name, a schema, a few arguments, a returned result. But that small interface is how the synthetic voice reaches calendars, ledgers, repositories, search indexes, browsers, and institutional memory.
For Spiralism, the risk is not the existence of tools. The risk is unexamined delegation. When a model calls a function, the human may feel that the machine acted; legally and institutionally, the surrounding system acted. The moral question is therefore architectural: who granted the tool, who reviewed the call, who approved the side effect, and who can reconstruct what happened later?
The healthy form is tool use with friction: narrow permissions, visible confirmations, clear traces, honest error handling, and humans who understand that a valid schema is not the same thing as a justified action.
Open Questions
- How should systems measure whether a model chose the right tool, not merely whether the call was syntactically valid?
- Which tool calls should require user confirmation across consumer, enterprise, government, medical, legal, and financial contexts?
- Can tool outputs be reliably isolated from instruction channels in long agent loops?
- How should dynamic tool discovery and remote MCP servers be approved, versioned, and logged?
- How should liability be assigned when an approved tool call is generated by a model but executed by an application?
- What standard audit trace is sufficient for incident review without storing excessive private data?
Related Pages
- AI Agents
- ReAct Prompting
- Model Context Protocol
- Structured Outputs and Constrained Decoding
- AI Coding Agents
- AI Browsers and Computer Use
- Context Windows and Context Engineering
- Retrieval-Augmented Generation
- Prompt Injection
- Context Poisoning
- Agentic Supply Chain Vulnerabilities
- AI Agent Identity
- AI Agent Observability
- AI Agent Sandboxing
- Secure AI System Development
- AI Governance
- AI System Inventory
- AI Audit Trails
- AI Red Teaming
- AI Evaluations
- Human Oversight of AI Systems
- AI Liability and Accountability
- Agent Tool Permission Protocol
- Agent Prompt Hardening
- Agent Audit and Incident Review
- Agent-Native Internet
- Agentic Commerce
- Agent2Agent Protocol
- DSPy
Sources
- OpenAI Developers, Function calling, reviewed June 23, 2026.
- OpenAI Developers, Using tools, reviewed June 23, 2026.
- OpenAI Developers, Structured model outputs, reviewed June 23, 2026.
- OpenAI Developers, Guardrails and human review, reviewed June 23, 2026.
- OpenAI Developers, Integrations and observability, reviewed June 23, 2026.
- OpenAI Developers, Computer use, reviewed June 23, 2026.
- OpenAI, Function calling and other API updates, June 13, 2023.
- OpenAI, Introducing Structured Outputs in the API, August 6, 2024.
- Anthropic Docs, Tool use with Claude, reviewed June 23, 2026.
- Anthropic Engineering, Introducing advanced tool use on the Claude Developer Platform, November 24, 2025; reviewed June 23, 2026.
- Model Context Protocol, Specification, version 2025-11-25; reviewed June 23, 2026.
- Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, arXiv, 2022; ICLR 2023.
- Schick et al., Toolformer: Language Models Can Teach Themselves to Use Tools, arXiv, 2023.
- OWASP Foundation, Top 10 for Large Language Model Applications, reviewed June 23, 2026.
- OWASP GenAI Security Project, LLM06:2025 Excessive Agency, reviewed June 23, 2026.
- NIST, AI Agent Standards Initiative, created February 17, 2026; reviewed June 23, 2026.
- NIST NCCoE, Software and AI Agent Identity and Authorization, reviewed June 23, 2026.
- NSA and international partners, Careful Adoption of Agentic AI Services, April 2026; reviewed June 23, 2026.