LLM Fundamentals | Visual Explainer

RAG (Retrieval Augmented Generation) means: when the user asks a question, you first retrieve relevant pieces from your own data (docs, KB), then you pass those pieces plus the question to the LLM so it can generate an answer grounded in that context. Without RAG, the model only knows its training data; with RAG, it can use up-to-date or private docs. The usual implementation: chunk your documents, embed each chunk, store embeddings in a vector DB. At query time, embed the question, run similarity search (top-k), and send the retrieved chunks + question to the LLM.

RAG pipeline

Convert documents → embeddings
Store in vector DB
Convert user query → embedding
Find closest vectors
Send top matches to LLM

Example: RAG in practice

You have 10,000 support articles. User asks: "How do I reset my password?" You embed the question, search the vector DB for the 5 closest article chunks (e.g. by cosine similarity), and prompt the LLM: "Context: [chunk1] [chunk2] … Question: How do I reset my password?" The model answers using only that context, so it stays on your content and doesn’t hallucinate unrelated steps.

The problem: speed

If you have 5 million chunks, comparing the query embedding to every chunk (brute force) is too slow. You need ANN (e.g. HNSW) and a vector DB so search is roughly O(log n) instead of O(n).

🔍 Chapter 13: Vector Search (How RAG Works)