Context window

The maximum number of tokens a model can process in a single request.

The context window is the budget for system prompt, retrieved context, user query, conversation history, and output combined. Frontier models offer 200k to 2M tokens, but practical use is bounded by latency, cost, and attention degradation over very long inputs. Architecture decisions (RAG vs full-context, compaction, summarization) hinge on context window size.

When it matters

When deciding what to put in the prompt. Bigger context windows are not always better — irrelevant context degrades quality (Liu et al., 2023 'Lost in the Middle').

Real example

A long-context summary task: feeding the full 100k-token doc to Claude vs feeding retrieved top-20 chunks (8k tokens). The retrieved version scored 15% better on factual accuracy because the model focused on relevant passages instead of being distracted by noise.

KPIs to watch

Context utilization rate (% of context that materially affects output), retrieval recall vs full-context (retrieval often wins above 50k tokens), cost per call (linear in context size).

Related terms

LLM (Large Language Model)

A transformer-based model trained on language data to predict and generate text.

RAG (Retrieval-Augmented Generation)

Generation grounded in retrieved source documents rather than the model's parametric memory alone.

Frontier model

The leading-edge foundation models with the highest reasoning, coding, and multimodal capabilities.

Foundation model

A large model pre-trained on broad data, then adapted to many downstream tasks.

See it in action

We use this every week

Send a short brief and we'll walk you through how Context window shows up in a real engagement we're running. We reply within one business day.

Start a project →