LLM Fundamentals | Visual Explainer

Agents need memory to be useful across turns. Short-term memory is the current context window: the last N messages, the latest tool calls and results. Everything in the prompt that the model can "see" is short-term; when the window is full, older content is dropped or summarized. Long-term memory is persistent storage: a vector DB of past facts, user preferences, or conversation summaries. The agent doesn’t read the whole store every time — it retrieves relevant pieces (e.g. via similarity search) and injects them into the context. So the model gets "you prefer Celsius" or "last time we discussed project X" only when relevant.

How memory feeds into the conversation

System prompt

Conversation history (short-term)

Retrieved long-term memory (e.g. preferences)

↓ combined into

Context window (what the LLM sees)

↓

LLM generates reply (using full context)

Short-term = current chat. Long-term = retrieved (e.g. from vector DB) and injected only when relevant.

Short-term (context)

What’s in the current conversation: last N messages, latest tool results. Lost when the session ends or the window is trimmed.

Example: "You asked for London weather; the result was 12°C. You then asked if you need a jacket — the model uses that to say yes."

Long-term

Stored across sessions: vector DB of past facts, user preferences, or conversation summaries. The agent retrieves relevant bits (e.g. via RAG) and adds them to context.

Example: "User prefers Celsius and lives in Berlin" stored; when they ask "weather today" the agent can default to Berlin and Celsius.

In practice

RAG is a form of long-term memory (retrieve docs). Conversation history is short-term. Some products add explicit "memories" (user-editable facts) that are stored and retrieved when relevant.

🧠 Chapter 10: Memory in Agents