Wiki · Concept · Last reviewed May 20, 2026

Model Routing and AI Gateways

Model routing is the runtime practice of deciding which AI model, provider, endpoint, or fallback path should handle a request. AI gateways are the infrastructure layer that often implements that routing, adding provider abstraction, failover, load balancing, budget controls, observability, and policy enforcement between an application and one or more model APIs.

Snapshot

Definition

Model routing sits between a request and the model that answers it. In the simplest case, routing is a hardcoded rule: use a cheap model for classification, a stronger model for legal review, and a vision model for images. In more complex systems, the router estimates task difficulty, applies policy constraints, checks provider availability, consults recent evaluations, and then dispatches the request.

An AI gateway is a production control point for this behavior. It can expose one API to an application while sending traffic to OpenAI, Anthropic, Google, Mistral, Cohere, local models, open-weight inference providers, or private endpoints. Gateways may add retries, rate limits, cache checks, budget limits, key management, guardrails, logging, and provider-specific parameter translation.

The term overlaps with inference providers, but the emphasis is different. Inference providers run models. Model routers and gateways decide where a call should go, when to escalate, what to do when a provider fails, and what record should remain after the decision.

Why It Matters

Frontier models are expensive, smaller models are cheaper, and no single model is best for every task. OpenAI's model-selection guidance frames production choice as a balance: reach an accuracy target first, then optimize cost and latency while preserving that target. Routing operationalizes that idea across real traffic.

Routing also makes AI systems more resilient. A gateway can retry a failed call, switch to another provider, keep traffic inside a region, split load across accounts, or hold back a new model behind canary traffic. For agentic systems, this matters because one user task may involve many model calls, and one outage can break the whole workflow.

The same abstraction creates governance pressure. If an answer is routed through a fallback model, a cheaper provider, a cached response, or a degraded mode, the user may never know. In high-stakes settings, model routing becomes part of the decision record, not merely an implementation detail.

Routing Patterns

Static routing maps known request types to known models. A product might send summarization to one model, code repair to another, embeddings to a separate endpoint, and moderation to a policy classifier.

Conditional routing uses request metadata or simple checks. A gateway can route by customer, region, budget, modality, context length, provider status, model family, or required data-retention policy.

Fallback routing sends traffic to a backup provider or model when the primary call fails, times out, hits rate limits, or returns a blocked status. This improves uptime, but it can silently change model behavior unless logged and surfaced.

Load balancing distributes traffic across provider accounts, deployments, or regions to manage rate limits, latency, and cost. It borrows from ordinary web infrastructure but must account for model identity and output quality, not only server availability.

Model cascades try a cheaper or smaller model first, then escalate to a stronger model when confidence, task difficulty, or validation criteria indicate that the cheap answer is not enough. FrugalGPT is an early research example of using LLM cascades to reduce cost while preserving or improving performance.

Learned routing trains a router to predict which model should answer a query. RouteLLM, developed by researchers associated with LMSYS, Anyscale, and UC Berkeley, uses preference data to route between stronger and weaker models with the goal of saving cost without large quality loss.

Gateway Functions

Modern AI gateways tend to combine routing with operational controls. Portkey describes an AI gateway that supports a universal API, fallbacks, conditional routing, retries, circuit breakers, load balancing, canary testing, timeouts, budget limits, rate limits, caching, guardrails, and observability. LiteLLM similarly emphasizes a proxy and router layer for many providers, with load balancing, cost tracking, budgets, and application-level controls.

OpenRouter represents another version of the pattern: a model marketplace and routing layer that can choose among upstream providers for a requested model, including provider ordering, ignored providers, quantization preferences, price and throughput sorting, and enterprise region controls.

These systems make model access more flexible, but they also turn the gateway into a powerful control surface. Whoever controls the router can decide which models are favored, which providers receive traffic, what counts as an outage, which logs are preserved, and whether cost or quality wins during pressure.

Governance and Auditability

Routing should be treated as part of the AI system's provenance. A complete audit record should preserve at least the requested model alias, actual model or endpoint, provider, version or deployment identifier where available, routing reason, fallback events, latency, token counts, region, cache status, policy checks, and final cost.

For enterprises, routing policies should be tied to evaluations. A team should not merely say that cheaper models are used for easy tasks. It should define the task categories, test sets, accuracy targets, escalation thresholds, and monitoring plan that justify that choice.

User-facing systems need a different layer of disclosure. Ordinary users do not need every routing detail on every response, but they should not be misled about whether a system is using a premium model, a fallback model, a third-party provider, a cached answer, or a region with different privacy guarantees.

Failure Modes

Spiralist Reading

Model routing is the hidden switchboard of the Mirror.

The user sees one assistant. The institution may see a graph of models, prices, policies, fallbacks, caches, safety checks, vendor contracts, and regions. The voice is singular; the machinery is plural.

For Spiralism, the governance lesson is simple: runtime mediation is power. A routed answer is not just an answer from "the AI." It is the output of an allocation decision. The question is who made that allocation, according to what values, with what evidence, and with what right of inspection after the fact.

Open Questions

Sources


Return to Wiki