Your LLM Is Guessing Ahead. Then It Checks Itself aka Speculative Decoding

7 min read

6 hours ago

Press enter or click to view image in full size

Every token your LLM generates costs one full forward pass. One pass, one token. No shortcuts.

That is the bottleneck. Not compute. Not memory bandwidth, exactly. The sequential dependency. Token N cannot be generated until token N−1 exists. The GPU sits 90% idle between passes, waiting.

Speculative decoding breaks this. It lets a small model guess several tokens ahead, then lets the big model verify all of them in a single pass.

That sentence sounds like it should change the output. It does not. The math guarantees it. That guarantee is what nobody shows you.

Let us examine it.

*No paywall version*

The setup

You have two models.

The target model p. This is the large model. Llama-3.1–70B, say. Slow. Expensive. Correct.

The draft model q. This is small. Maybe a 1B parameter head attached to the target model’s own internals. Fast. Cheaper. Slightly wrong.

You want outputs that look exactly like p generated them. You want to use q to go faster. These seem to be in conflict.

They are not.

Your LLM Is Guessing Ahead. Then It Checks Itself aka Speculative Decoding

The setup

What happens at each step