Context Window

The maximum amount of text — measured in tokens — that an LLM can process in a single inference call, including both input and generated output.

The context window is the hard ceiling on what an LLM can “see” in a single call. In 2026, frontier models range from ~200k tokens (Claude Sonnet 4.x, GPT-4.x standard) to 1M+ tokens (Claude 1M-context tier, Gemini 1.5 Pro). One token is roughly three-quarters of an English word, so a 200k context holds a few hundred pages.

For agents, the context window is the binding constraint on memory. Everything the agent “knows” in a given step must fit: system prompt, conversation history, retrieved documents, tool definitions, tool results. Long-running agents either truncate history aggressively (losing context) or use external agent memory with retrieval (RAG) to keep relevant state out of the context window.

Pricing scales with context length — even when the model supports 1M tokens, you typically don’t want to use them all because the per-call cost becomes prohibitive. Production agents minimize context use through prompt engineering, RAG, and selective memory rather than relying on raw window size.

Common confusion: context window ≠ training data size. The context window is per-call. Training data is what the model learned during pre-training and is fixed at the model’s release.