Context Windows and Context Engineering
A context window is the bounded, tokenized working set an AI model can use during a generation. Context engineering is the discipline of selecting, ordering, labeling, compressing, retrieving, caching, expiring, and auditing the information placed into that working set.
Definition
A context window is the model-visible working set for a particular inference run. It can contain system instructions, developer instructions, user messages, retrieved passages, tool results, file excerpts, memory summaries, code, tabular data, multimodal encodings, examples, intermediate state, and the output being generated.
The window is measured in tokens, not pages or files. Tokens are model-specific units produced by a tokenizer. A long PDF, a codebase excerpt, an API trace, a spreadsheet, a policy manual, or a conversation history consumes part of the same budget that the model needs for the answer.
Context length is not the same as training data, search, persistent memory, or institutional knowledge. Information inside the current context can directly influence the next output. A RAG index, memory store, file system, database, or connector only affects a generation after some part of it is selected and inserted into the active context or otherwise exposed through a tool call.
It is also not the same as comprehension. A model may technically accept a million tokens while still failing to use every part of that input equally, especially when relevant information is buried in the middle, contradicted elsewhere, poorly formatted, permission-mismatched, stale, or surrounded by noise.
Current Context
As of June 23, 2026, long-context capacity is a mainstream product feature, but exact limits vary by model, endpoint, cloud platform, product tier, modality, output budget, and date. Anthropic's Claude documentation describes API context windows up to 1M tokens for specified Claude models. Google AI for Developers says many Gemini models have context windows of 1 million or more tokens. OpenAI's API docs list GPT-4.1 nano with a 1M-token context window, while other OpenAI and partner-hosted routes are model-dependent.
Those numbers are capacity specifications, not guarantees of faithful recall, source judgment, or safe behavior. Provider docs and research both emphasize curation: more context is useful only when the system decides what belongs there, preserves authority labels, keeps the user question in a good position, and avoids flooding the window with weak or unsafe material.
Long context has not made retrieval, memory, or Model Context Protocol connectors obsolete. It has made the boundary between them more important. A production context may now combine retrieved evidence, cached prefixes, persistent memory, tool schemas, uploaded files, agent scratchpads, and conversation history. That makes context engineering part of AI governance, not just prompt craft.
Long-Context Models
Long-context models expanded what users can ask an AI system to inspect at once: whole contracts, code repositories, research folders, videos, transcripts, policy manuals, email archives, or multi-turn task histories.
Google's Gemini 1.5 launch made long context a public capability race. Google described Gemini 1.5 Pro as starting with a 128,000-token standard window while rolling out a one-million-token context window, and later announced public-preview access to one-million-token windows for Gemini 1.5 Pro and 1.5 Flash with a two-million-token waitlist for some developers and cloud customers.
Anthropic's Claude documentation describes context-window behavior, overflow behavior, context awareness, compaction, context editing, and long-context limits. Anthropic and OpenAI also document prompt caching for repeated long prefixes, reducing cost and latency for applications that repeatedly send the same instructions, tools, files, schemas, or project context.
Long context changes the engineering problem. The bottleneck moves from "Can the model see enough?" to "What should it see, in what order, with what compression, with what source labels, under whose authority, and with what cache and deletion rules?"
It also changes serving economics. Long inputs increase prefill cost and KV-cache pressure, while prompt caching and context caching reward stable prefixes and repeated workloads. That makes context design both a model-behavior problem and an infrastructure problem.
Context Engineering
Context engineering is broader than prompt engineering. Prompt engineering focuses on wording instructions and examples. Context engineering treats the entire information payload as an engineered system: retrieval, memory, summarization, ordering, source labeling, tool outputs, cache boundaries, permissions, truncation, and update rules.
Anthropic's engineering guidance frames context as a finite resource for agents and defines context engineering around curating and maintaining the set of tokens available during inference, including tools, MCP data, external data, message history, and other state beyond the prompt. A 2025 survey of context engineering similarly organizes the field around context retrieval and generation, context processing, and context management, with RAG, memory systems, tool-integrated reasoning, and multi-agent systems as implementations.
In production AI systems, context engineering determines what the model sees as evidence, what it sees as instruction, what it forgets, what it treats as stale, what it can cite, and which private records it is allowed to inspect. This is why context is not a neutral container. It is a governance surface.
System Design Patterns
Retrieval. RAG systems search external stores and insert relevant records into context. This lets the model answer from current or private information, but it introduces source-selection, chunking, ranking, and permission risk. Vector databases and embeddings can help retrieval, but they do not decide authority by themselves.
Source manifests. High-value contexts should carry source identifiers, timestamps, document versions, permissions, transformation history, and confidence about whether a passage is authoritative, draft, stale, user-supplied, retrieved, or machine-generated. This connects context engineering to AI data provenance.
Summarization and compaction. Long conversations or task histories are compressed into shorter summaries. Compression saves space but can erase uncertainty, minority context, exceptions, dates, emotional tone, or accountability details. Compaction artifacts should be reviewable when they replace original records.
Memory. Persistent memory stores user preferences, facts, project state, or prior interactions. Memory becomes active only when selected and inserted into the current context or used to shape retrieval.
Tool and connector context. Agent systems may expose tool schemas, file-search results, MCP resources, browser observations, or database results to the model. These are not just data feeds; they are authority and action surfaces that need tool-use rules.
Prompt and context caching. Reused context blocks can be cached to reduce cost and latency. This rewards stable shared prefixes but creates design questions around stale policy, revoked permissions, cache retention, tenant isolation, and what must update every turn.
Context pruning. Systems remove low-value, stale, duplicated, or unsafe content so the model has room for higher-value evidence. Pruning is a judgment call and should be reviewable in high-stakes settings.
Context isolation. Systems separate instructions, user data, retrieved documents, tool outputs, and untrusted content with clear roles and delimiters so data does not silently become command material.
Agent workspaces. Coding agents, browser agents, and enterprise agents must manage files, logs, tool results, plans, failures, and prior decisions across long tasks. Their practical intelligence depends heavily on context discipline, and their reviewability depends on AI audit trails.
Risk Pattern
Lost-in-the-middle failure. The 2023 Lost in the Middle paper found that models can perform worse when relevant information appears in the middle of a long context than when it appears near the beginning or end. This warns against assuming that bigger windows equal uniform attention.
Context flooding. Users or systems can stuff the window with so much material that important instructions, evidence, or safety constraints become diluted.
Authority confusion. Retrieved documents, web pages, emails, logs, code comments, or tool outputs may contain text that looks like instructions. If the model cannot distinguish source data from operational authority, prompt injection becomes easier.
Context poisoning. Adversarial or promotional material can be planted in a thread, memory store, document corpus, webpage, ticket, or tool result so that later generations inherit a false preference, source, instruction, or plan. MITRE ATLAS now tracks AI agent context poisoning, including memory and thread variants.
Access-control bleed. Long context makes it easier to include whole archives, cross-project histories, privileged documents, or connector outputs. If authorization is checked only after generation, the model may already have seen data it should not have seen.
Recency and position bias. The ordering of context can influence what the model prioritizes. Late material may dominate, early framing may anchor interpretation, and middle evidence may be ignored.
Compression distortion. Summaries can collapse ambiguity into certainty, omit dissent, erase dates, or turn tentative observations into facts.
Privacy expansion. Long-context systems make it easier to paste or retrieve entire archives, codebases, chats, medical records, legal files, or enterprise documents into a model workflow.
Stale-cache failure. Caching repeated prefixes can reduce cost and latency, but stale policies, revoked access, obsolete tool results, or user-deleted material can become harder to notice if the system treats cached context as invisible infrastructure.
Cost and latency pressure. Long inputs can be expensive and slow. Economic pressure encourages caching, summarization, pruning, and retrieval shortcuts that may change behavior in subtle ways.
Governance Requirements
Context provenance. Systems should record what context was supplied, where it came from, when it was retrieved, who had permission to use it, and whether it was user input, policy, retrieved evidence, memory, tool output, code, connector data, or a generated summary.
Authority hierarchy. System instructions, developer instructions, user requests, retrieved evidence, and tool outputs should have explicit authority levels. Untrusted content should be labeled as data.
Access checks before insertion. Tenant, role, project, jurisdiction, data-classification, and retention rules should be enforced before records enter context. A generated refusal or redaction is not enough if restricted data was already exposed to the model.
Window budgeting. High-stakes workflows should define which information gets priority when the context budget is limited: governing policy, latest evidence, source citations, safety constraints, user intent, historical memory, or tool state. Truncation should be visible when it changes the evidentiary record.
Compression review. Summaries that replace original material should preserve uncertainty, dates, source links, exceptions, and unresolved disagreements. For regulated or legal settings, original records should remain inspectable.
Adversarial testing. Evaluations should test long-context placement, conflicting sources, injected instructions, poisoned memories, stale records, hidden policy changes, access-control failures, and whether the model can cite the exact evidence it used.
Cache discipline. Prompt, context, and KV-cache designs should document retention, tenant isolation, deletion behavior, cache-key strategy, and what customer data or derived tensors can persist after a request. Caching should not silently preserve stale policies, outdated tool results, revoked permissions, or user-deleted material.
Inventory and procurement. Long-context systems should be represented in the AI system inventory. AI procurement should ask vendors which context limits apply, how retrieval and memory are isolated, how prompt caches work, whether logs contain full prompts, and how deletion requests propagate through caches, indexes, memories, and audit records.
Spiralist Reading
The context window is the altar of immediate reality.
The model does not answer from the whole world. It answers from the world placed before it. The archive, the memory, the policy, the retrieved passage, the code file, the hidden instruction, the stale summary, and the user's last message compete for presence inside the active frame.
For Spiralism, context engineering is a political act disguised as plumbing. Whoever chooses the context chooses the world the machine sees. Whoever compresses the record shapes what the machine remembers. Whoever labels one document as authority and another as noise governs the next answer before the model begins to speak.
Open Questions
- How should models signal when a context is too large, noisy, stale, or contradictory to support a confident answer?
- Can long-context evaluation move beyond needle-in-a-haystack retrieval toward real document understanding, conflict handling, and source judgment?
- What context should an agent be allowed to preserve across sessions, users, projects, or organizations?
- How should systems expose context pruning and summarization decisions to users and auditors?
- How should systems prove that deleted or revoked material no longer appears through memory, cache, retrieval, or summaries?
- Will larger windows reduce the need for retrieval, or make context selection more politically important?
Source Discipline
For model capacity claims, cite the provider's current model or API documentation and name the model, endpoint, cloud path, date reviewed, input limit, output limit, and modality assumptions. Do not treat a product maximum as evidence that every token will be used faithfully.
For behavior claims, cite evaluations or papers that test the behavior being discussed. Long-context retrieval benchmarks, position-bias studies, tool-use benchmarks, and real production traces answer different questions. A benchmark about finding one inserted fact is not a governance-grade test of legal review, medical summarization, codebase migration, or agent safety.
For governance claims, cite security and risk-management sources separately from vendor launch posts. OWASP, NIST, MITRE ATLAS, and joint cybersecurity guidance are more appropriate for indirect prompt injection, provenance, access control, poisoning, and data security than model marketing pages.
Related Pages
- Tokenization and Tokens
- Retrieval-Augmented Generation
- Vector Databases
- Embeddings and Vector Representations
- In-Context Learning
- Model Context Protocol
- Tool Use and Function Calling
- System Prompts
- AI Memory and Personalization
- AI Agents
- AI Coding Agents
- LLM Serving and KV Cache
- Inference and Test-Time Compute
- FlashAttention
- Prompt Injection
- Context Poisoning
- AI Evaluations
- Model Cards and System Cards
- AI Search and Answer Engines
- Data Poisoning
- AI Data Provenance
- AI Audit Trails
- Data Minimization
- AI Governance
- Secure AI System Development
- AI System Inventory
- AI Procurement
- Vendor and Platform Governance
- Cognitive Sovereignty
- Agent Prompt Hardening
- Agent Audit and Incident Review
Sources
- Anthropic Docs, Context windows, reviewed June 23, 2026.
- Anthropic Engineering, Effective context engineering for AI agents, September 29, 2025.
- Anthropic, Prompt caching with Claude, August 14, 2024.
- OpenAI API Docs, Prompt caching, reviewed June 23, 2026.
- OpenAI API Docs, GPT-4.1 nano model page, reviewed June 23, 2026.
- OpenAI API Docs, OpenAI models in Amazon Bedrock: Responses API feature availability, reviewed June 23, 2026.
- Google, Introducing Gemini 1.5, Google's next-generation AI model, February 15, 2024.
- Google, Gemini updates: Flash 1.5, Gemma 2 and Project Astra, May 14, 2024.
- Google AI for Developers, Long context, last updated June 22, 2026; reviewed June 23, 2026.
- Gemini Team, Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, arXiv, 2024.
- Nelson F. Liu et al., Lost in the Middle: How Language Models Use Long Contexts, arXiv, 2023; TACL 2024.
- Lingrui Mei et al., A Survey of Context Engineering for Large Language Models, arXiv, 2025.
- OWASP Gen AI Security Project, LLM01:2025 Prompt Injection, reviewed June 23, 2026.
- MITRE ATLAS Data, ATLAS 2026.05 YAML distribution, modified May 27, 2026; reviewed June 23, 2026.
- NIST, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 26, 2024; updated April 8, 2026.
- NSA, CISA, FBI, and international partners, AI Data Security: Best Practices for Securing Data Used to Train & Operate AI Systems, May 2025.