LLM Fundamentals | Visual Explainer

Context is everything the model can "see" when generating the next token: your system prompt, the current user message, and (in chat) the recent conversation history and any injected tool results. The context window is the hard limit on how many tokens can be in that context. If your prompt plus history exceed it, something must give: older messages are dropped, summarized, or the request fails. So long threads or very long documents can "forget" the start unless you summarize or chunk.

Context window (fixed size)

The model only "sees" a limited number of tokens. Like a bucket that can hold only so much.

Hi———————

1/8

Try different text (token count ≈ words):

Real models: 4K–128K+ tokens. Prompt + conversation history must fit inside the window.

Example: Context in practice

You paste a 50-page doc and ask "What’s the main conclusion?" If the doc is 30K tokens and the model’s window is 8K, the model never sees the full doc — only the first 8K (or whatever fits after your question). So you either use a model with a larger window, or you chunk the doc and summarize each chunk first, then ask the model to synthesize.

Why it matters

Larger windows allow longer conversations and bigger inputs but use more memory and cost more per call. Models advertise limits like 4K, 8K, 32K, 128K tokens. Design your prompts and conversation flow to stay within the window.

🪟 Chapter 16: Understanding Context Windows