Context is everything the model can "see" when generating the next token: your system prompt, the current user message, and (in chat) the recent conversation history and any injected tool results. The context window is the hard limit on how many tokens can be in that context. If your prompt plus history exceed it, something must give: older messages are dropped, summarized, or the request fails. So long threads or very long documents can "forget" the start unless you summarize or chunk.
Context window (fixed size)
The model only "sees" a limited number of tokens. Like a bucket that can hold only so much.
Try different text (token count β words):
Real models: 4Kβ128K+ tokens. Prompt + conversation history must fit inside the window.
Example: Context in practice
You paste a 50-page doc and ask "Whatβs the main conclusion?" If the doc is 30K tokens and the modelβs window is 8K, the model never sees the full doc β only the first 8K (or whatever fits after your question). So you either use a model with a larger window, or you chunk the doc and summarize each chunk first, then ask the model to synthesize.
Why it matters