← All episodes

Ep. 02·February 1, 2024·Cody Feda

RAG

Structural

How AI systems look things up instead of making things up

tl;dr

$ Retrieval-Augmented Generation (RAG) is an architecture pattern in which an AI system dynamically fetches relevant source documents at query time and injects them into the prompt before generating a response - reducing hallucination by grounding the model in retrieved evidence rather than relying on parametric memory alone.

The problem RAG solves

Language models are trained on a snapshot of the world. Once training ends, the model's knowledge is frozen. Ask it about something that happened after its cutoff, and it will either say it doesn't know or, worse, confidently invent a plausible-sounding answer.

Even for things within its training data, a model may get details wrong - not because it's broken, but because the information was represented inconsistently across millions of training documents and the model learned a blurry average.

RAG solves this by separating what the model knows from what the model can look up. Instead of baking all possible knowledge into model weights, you keep a searchable knowledge base external to the model and retrieve from it dynamically at inference time.

The architecture in plain terms

A RAG pipeline has three stages:

Indexing (happens once, offline): Take all your source documents - PDFs, wiki pages, Slack messages, whatever - split them into chunks, embed each chunk into a vector using an embedding model, and store the vectors in a vector database alongside the original text.

Retrieval (happens per query): When a user asks a question, embed the question using the same embedding model, search the vector database for the most semantically similar chunks, and return the top-k results.

Generation (happens per query): Build a prompt that includes the retrieved chunks as context ("Based on the following documents: [...]") and the original user question. Send that prompt to the LLM. The model generates an answer grounded in the retrieved evidence.

What RAG is good at

RAG excels at closed-domain question answering - answering questions about a specific, bounded corpus. Customer support bots, internal knowledge bases, documentation search, legal research assistants. Anywhere you have a reasonably organized body of text and need to answer natural-language questions about it.

It's also the standard solution for keeping AI systems up to date. Instead of retraining a model every time your product documentation changes, you just update the index.

What RAG is not good at

RAG struggles when the answer to a question can't be found in a single chunk of text. Complex reasoning tasks that require synthesizing information across many documents, performing multi-step calculations, or answering questions that depend on private data the model has never seen during training - these push the limits of naive RAG.

There's also the retrieval quality problem. If the retrieval step returns irrelevant chunks, the LLM will generate a confident-sounding answer based on wrong context. Garbage in, garbage out, except the output looks polished.

Advanced RAG techniques - re-ranking, HyDE (Hypothetical Document Embeddings), multi-hop retrieval - address these failure modes, at the cost of added latency and complexity.

Structural verdict

RAG earns a structural rating because it's a real, working architectural pattern that solves a genuine problem. It doesn't bear load on its own - it requires a vector database, an embedding model, and an LLM to do anything at all. But when assembled correctly, the structure holds weight. It's the scaffolding that makes AI useful in domains where accuracy matters.

More episodes

Vector Database

A database that finds things by meaning, not by exact match

MCP

The USB-C port that lets AI plug into your tools

Vibe Coding

Letting an AI write your code while you provide the vibes