Every Token-Based Language Model Is Throwing Away Information at the Last Step. A 2025 Paper Proved You Can Generate Text Without Ever Committing to Tokens Until the End.
Discrete tokens were a computational convenience, not a theoretical necessity. ELF generates text entirely in continuous embedding space. The token is a readout, not the computation.
9 min read
1 day ago
--
What Every Language Model Does That It Should Not Have To
Every autoregressive language model generates text the same way.
GPT-4. Claude. Llama. Gemma. Every one of them.
They project continuous hidden states into a discrete vocabulary at every single generation step. They pick a token. They embed that token. They feed it back as the next input. They do this a hundred thousand times to generate a document.
The formula, unchanged since the first GPT:
h_t = Transformer(x_1, ..., x_{t-1})
p_t = softmax(W_E · h_t)
x_t = argmax(p_t) or sample from p_tVaswani et al. (2017) did not choose this design because it was optimal. They chose it because discrete tokens are what humans use to label text data. The vocabulary is how the training signal is defined. The token is a training artifact, not an inductive truth.