14 min read
1 day ago
--
Press enter or click to view image in full size
Training a large language model from scratch costs tens of millions of dollars.
Serving it at scale costs more.
The transformer’s quadratic attention is the reason. Every new token must attend to every previous token.
At 128K context length, that is 16 billion attention operations per layer per forward pass.
Mamba does not have this problem. Its state-space mechanism processes sequences in linear time. The context length does not matter. The memory does not grow.
The catch: you cannot just swap attention for Mamba in a trained transformer. Apple tried. The perplexity went above 100. The model broke.
Their April 2026 paper explains why, and how to fix it.
The fix is two steps. Not one. And the reason two steps works while one step fails is one of the most illuminating things I have read about why attention and Mamba are similar but not the same.
Let us build each step.
Why Direct Distillation Fails
Knowledge distillation normally looks like this: you have a large teacher model and a smaller student model. You train the student to match the teacher’s…
