Skip to content
RAG 18 Jun 2026 1 min read

How RAG actually works (with code)

Retrieval-Augmented Generation grounds an LLM in your own data. The core loop in four steps — plus a tiny Python retriever.

RAG (Retrieval-Augmented Generation) grounds an LLM in your own data instead of relying only on what it memorized during training.

The core idea

  1. Embed your documents into vectors.
  2. Retrieve the most similar chunks for a query.
  3. Augment the prompt with those chunks.
  4. Generate an answer grounded in them.

Similarity is usually cosine similarity between embeddings:

$$ \text{sim}(a, b) = \frac{a \cdot b}{\lVert a \rVert \, \lVert b \rVert} $$

Here's a minimal retrieval loop in Python:

import numpy as np

def cosine(a, b):
    return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve(query_vec, docs, k=3):
    scored = [(cosine(query_vec, d["vec"]), d) for d in docs]
    scored.sort(key=lambda x: x[0], reverse=True)
    return [d for _, d in scored[:k]]

The retrieved chunks get stuffed into the context window before generation. Simple — but it changes everything.

#rag #embeddings #vector-db
Pantheraa Space
Digital Panther
Work with us