AI Agents path

🤖 Using AI & AI Agents

Chapter 7 of 24

💾 Chapter 7: Prompt Caching

Reuse cached prefix tokens to cut cost and latency on long prompts

Prompt caching (supported by some providers) lets the model reuse a prefix of the prompt across requests instead of reprocessing it every time. You send a long prefix once (e.g. system message + large doc); the API caches it. Later requests that use the same prefix only pay for and process the new part (e.g. the latest user message). That cuts cost and latency when you have many turns over the same context. Not all APIs support it; when they do, you often mark which part of the prompt is cacheable.

Prompt caching: reuse the prefix

Without caching

Request 1: [system + long doc] → full cost
Request 2: [system + long doc] → full cost again

Same prefix re-sent and re-processed every time.

With caching

Request 1: [system + long doc] → process + cache prefix
Request 2: [cached prefix] + new user message → only new tokens paid

Prefix (e.g. system + doc) stored; you pay less and get lower latency for follow-up turns.

Supported by some APIs (e.g. Anthropic, Vertex). Best when the same long context is used across many requests.

Example: Typical use

Long system prompt + 100-page doc as context for a support agent. First turn: full cost. Next 10 turns: only the new user/assistant messages are processed; the prefix is served from cache. Ideal for chat-over-docs or agents with a fixed, large context.