LLMs By Pantheraa Space 17 Jun 2026 Updated 21 Jul 2026 1 min read

The math behind transformer attention

Scaled dot-product attention in one formula, why the √dₖ scaling matters, and a tiny PyTorch sketch.

The heart of a transformer is scaled dot-product attention. Given queries $Q$, keys $K$ and values $V$:

$$ \text{Attention}(Q, K, V) = \text{softmax}\!\left( \frac{QK^\top}{\sqrt{d_k}} \right) V $$

The $\sqrt{d_k}$ scaling stops the dot products from growing too large, which would push softmax into regions with vanishing gradients.

Softmax for a vector $z$ is:

$$ \sigma(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $$

A tiny PyTorch sketch:

import torch
import torch.nn.functional as F

def attention(q, k, v):
    d_k = q.size(-1)
    scores = q @ k.transpose(-2, -1) / d_k ** 0.5
    weights = F.softmax(scores, dim=-1)
    return weights @ v

Once you see attention as a soft, differentiable lookup table, the rest of the architecture clicks into place.

#transformers #attention #math

Written by

Pantheraa Space

Pantheraa Space is an India-based digital marketing & development agency. We build websites and custom software, run Google & Meta Ads, and grow your visibility with SEO, Google Business Profile and AI search optimization (AIO).

Want this done for your business?

Get a free audit

Keep reading

Google Business Profile

Local SEO Checklist for Indian Businesses (2026)

SEO

How Long Does SEO Take to Show Results? An Honest Timeline

Meta Ads

Keep reading

Local SEO Checklist for Indian Businesses (2026)

How Long Does SEO Take to Show Results? An Honest Timeline

Facebook Ads Not Working? 7 Reasons Meta Ads Fail (and How to Fix Them)