7 min read
14 hours ago
--
Every explainer on Grouped Query Attention says the same thing.
“GQA reduces KV cache memory by sharing key-value heads across query groups.”
That sentence is technically correct and practically useless.
Nobody shows you the actual numbers. Nobody traces what happens to a specific key vector when eight query heads all try to use it simultaneously. Nobody shows why — despite the sharing — the attention patterns remain fully distinct.
Let me fix that.
Why the KV Cache Is the Deployment Bottleneck
During inference, a language model generates one token at a time. For each new token, the attention mechanism must attend over every previous token.
In standard transformers, this means computing Q, K, V for the current token and attending over the K and V vectors of all previous tokens.
Rather than recompute those previous K and V vectors at every step, the model stores them in a cache. This is the KV cache.
KV cache size for standard Multi-Head Attention (MHA):
H = number of heads = 64 (LLaMA-2 70B)
dₖ = key/value dimension per head = 128
L = context length
M = number of layers = 80Memory per token = 2 × H × dₖ × 4 bytes (factor 2 for K and V)
= 2 × 64 ×…