Skip to content
LLMs 17 Jun 2026 1 min read

The math behind transformer attention

Scaled dot-product attention in one formula, why the √dₖ scaling matters, and a tiny PyTorch sketch.

The heart of a transformer is scaled dot-product attention. Given queries $Q$, keys $K$ and values $V$:

$$ \text{Attention}(Q, K, V) = \text{softmax}\!\left( \frac{QK^\top}{\sqrt{d_k}} \right) V $$

The $\sqrt{d_k}$ scaling stops the dot products from growing too large, which would push softmax into regions with vanishing gradients.

Softmax for a vector $z$ is:

$$ \sigma(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $$

A tiny PyTorch sketch:

import torch
import torch.nn.functional as F

def attention(q, k, v):
    d_k = q.size(-1)
    scores = q @ k.transpose(-2, -1) / d_k ** 0.5
    weights = F.softmax(scores, dim=-1)
    return weights @ v

Once you see attention as a soft, differentiable lookup table, the rest of the architecture clicks into place.

#transformers #attention #math
Pantheraa Space
Digital Panther
Work with us