LLaMA-2 70B Has 64 Query Heads and 8 KV Heads. Here Is the Memory Arithmetic Nobody Shows You.

7 min read

14 hours ago

Every explainer on Grouped Query Attention says the same thing.

“GQA reduces KV cache memory by sharing key-value heads across query groups.”

That sentence is technically correct and practically useless.

Nobody shows you the actual numbers. Nobody traces what happens to a specific key vector when eight query heads all try to use it simultaneously. Nobody shows why — despite the sharing — the attention patterns remain fully distinct.

Let me fix that.

Why the KV Cache Is the Deployment Bottleneck

During inference, a language model generates one token at a time. For each new token, the attention mechanism must attend over every previous token.

In standard transformers, this means computing Q, K, V for the current token and attending over the K and V vectors of all previous tokens.

Rather than recompute those previous K and V vectors at every step, the model stores them in a cache. This is the KV cache.

KV cache size for standard Multi-Head Attention (MHA):

H  = number of heads = 64 (LLaMA-2 70B)
dₖ = key/value dimension per head = 128
L  = context length
M  = number of layers = 80

Memory per token = 2 × H × dₖ × 4 bytes   (factor 2 for K and V)
                 = 2 × 64 ×…