RAG
18 Jun 2026
1 min read
How RAG actually works (with code)
Retrieval-Augmented Generation grounds an LLM in your own data. The core loop in four steps — plus a tiny Python retriever.
RAG (Retrieval-Augmented Generation) grounds an LLM in your own data instead of relying only on what it memorized during training.
The core idea
- Embed your documents into vectors.
- Retrieve the most similar chunks for a query.
- Augment the prompt with those chunks.
- Generate an answer grounded in them.
Similarity is usually cosine similarity between embeddings:
$$ \text{sim}(a, b) = \frac{a \cdot b}{\lVert a \rVert \, \lVert b \rVert} $$Here's a minimal retrieval loop in Python:
import numpy as np
def cosine(a, b):
return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))
def retrieve(query_vec, docs, k=3):
scored = [(cosine(query_vec, d["vec"]), d) for d in docs]
scored.sort(key=lambda x: x[0], reverse=True)
return [d for _, d in scored[:k]]
The retrieved chunks get stuffed into the context window before generation. Simple — but it changes everything.
#rag
#embeddings
#vector-db
Pantheraa Space
Digital Panther