Skip to content
AEO Canon · the reference for answer-engine optimization

What Is a Context Window?

A context window is the maximum amount of text — measured in tokens — that an AI model can consider at once, including your prompt, any retrieved sources, and its own answer. It bounds what the model can "see."

BBurke Atkerson2 min read

A context window is the maximum amount of text — measured in tokens — that an AI model can consider at once, including your prompt, any retrieved sources, the conversation so far, and the answer it's writing. It is the model's short-term working memory, and anything that doesn't fit is, for that request, invisible.

The context window · a fixed token budget

Everything the model considers at once — the question, the retrieved passages, and room for the answer — shares one fixed window. Add passages or make them longer, and some stop fitting.

sys
p1
p2
p3
p4
p5
p6
answer
system retrieved passages answer headroom

6 of 6 passages fit — all fit, with room to spare. (~2,560 of 8,000 tokens used.) Tight, self-contained passages win the limited space — the budget reason to write answer-first.

What does the context window contain?

The context window contains everything the model needs to read for a single response — and it all shares one token budget. That includes the hidden system instructions, the conversation history, your current prompt, any documents retrieved to ground the answer, and the answer being generated token by token. If the total exceeds the window, something has to give.

Because the window is measured in tokens, tokenization directly affects how much fits: verbose or unusual text consumes the budget faster. And because the model attends across everything in the window at once (see how LLMs work), what you put in it — and what you leave out — shapes the answer.

How big are context windows in 2026?

Context windows in 2026 span a wide range, from a few hundred thousand tokens to several million. Many frontier models offer roughly 200,000 to 1,000,000 tokens; some long-context models advertise far more — Meta's Llama 4 Scout, for example, markets a context window around 10 million tokens. (Capacities change often; always check a model's current lab documentation.)

Bigger isn't automatically better

A large advertised window doesn't mean the model uses every position equally well. Models often attend most reliably to the beginning and end of a long context and can overlook details buried in the middle. More room helps — but placing the right passage, not just more text, is what drives a good answer.

What happens when text exceeds the window?

When text exceeds the window, the model simply cannot see the overflow — so systems decide what to keep. They truncate, drop older turns, or summarize history to make room. The consequence is concrete: a fact outside the window has zero chance of influencing the answer, no matter how relevant it is. This is why long documents are split into passages and only the most relevant chunks are pulled in, a problem studied directly in work like "Passage Segmentation of Documents for Extractive Question Answering" (arXiv 2501.09940), which examines how to break documents into passages that answer questions well.

Does a big window remove the need for retrieval?

A big context window does not remove the need for retrieval. Even when you could paste an entire corpus into a huge window, it's usually a bad idea: it's slow, expensive (you pay per token), and it dilutes the model's attention across mostly irrelevant text. Selecting the few most relevant passages with retrieval-augmented generation typically produces more accurate answers at a fraction of the cost. Window size and retrieval are complementary, not substitutes.

Why does the context window matter for AEO?

The context window matters for AEO because your content competes for a scarce, finite space. When an engine retrieves sources to answer a question, only a handful of passages make it into the window — and a self-contained, answer-first passage earns its place more easily than a long, rambling one that would crowd out everything else. Writing tight, liftable passages isn't only about readability; it's about fitting into the budget where the answer is actually formed. That's the context-window view of extractability and a core idea in what is AEO.

Next: what is RAG for how passages are chosen to fill the window, or what is grounding for what they're used to do.

Frequently asked questions

What is a context window in an LLM?
The context window is the maximum number of tokens a model can process in a single request — its short-term working memory. It must hold everything at once — the system instructions, your prompt, any retrieved documents, the conversation history, and the answer being generated. Anything beyond the window is invisible to the model.
How big are context windows in 2026?
They range widely. Many frontier models offer windows around 200,000 tokens to 1 million tokens, and some long-context models (such as Llama 4 Scout) advertise up to about 10 million tokens. Bigger windows let a model consider more text at once, but don't guarantee it uses every part equally well.
What happens when you exceed the context window?
The model can't see the overflow. Systems handle it by truncating, dropping older messages, or summarizing — which means details outside the window are simply ignored. This is why long documents are split into passages and only the most relevant chunks are retrieved into the window.
Does a bigger context window replace retrieval?
No. Even with a large window, feeding an entire corpus every time is slow, costly, and dilutes attention. Retrieval (RAG) selects the few most relevant passages to place in the window, which is usually more accurate and far cheaper than relying on size alone.

Last updated .

Related reading

AI-detection tools are unreliable and not what answer engines use to decide citations — engines judge content on quality, originality, and accuracy, not on whether a machine wrote it. Stop chasing detection and make content genuinely original, because generic content fails no matter who wrote it.

2 min read

AI-generated content can get cited, but only when it's made genuinely original, accurate, and useful — raw model output tends to be generic, unsourced, and interchangeable, which is exactly what engines skip. The deciding factor is the substance and originality you add, not whether a model helped write it.

2 min read

It depends on the engine — web-grounded engines like Perplexity and Google AI can surface new content within days once it's crawled, while a model's built-in training knowledge lags months behind its cutoff. So fresh content reaches retrieval-based answers quickly but base-model knowledge slowly.

2 min read