LLMs
17 Jun 2026
1 min read
The math behind transformer attention
Scaled dot-product attention in one formula, why the √dₖ scaling matters, and a tiny PyTorch sketch.
The heart of a transformer is scaled dot-product attention. Given queries $Q$, keys $K$ and values $V$:
$$ \text{Attention}(Q, K, V) = \text{softmax}\!\left( \frac{QK^\top}{\sqrt{d_k}} \right) V $$The $\sqrt{d_k}$ scaling stops the dot products from growing too large, which would push softmax into regions with vanishing gradients.
Softmax for a vector $z$ is:
$$ \sigma(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $$A tiny PyTorch sketch:
import torch
import torch.nn.functional as F
def attention(q, k, v):
d_k = q.size(-1)
scores = q @ k.transpose(-2, -1) / d_k ** 0.5
weights = F.softmax(scores, dim=-1)
return weights @ v
Once you see attention as a soft, differentiable lookup table, the rest of the architecture clicks into place.
#transformers
#attention
#math
Pantheraa Space
Digital Panther