Abstract:Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_n$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.
| Subjects: | Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR) |
| MSC classes: | 68T07, 41A60, 82B26, 60G70 |
| ACM classes: | G.3; I.2.7 |
| Cite as: | arXiv:2605.12697 [stat.ML] |
| (or arXiv:2605.12697v1 [stat.ML] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12697 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Tomohiro Hayase [view email]
[v1]
Tue, 12 May 2026 19:48:36 UTC (1,423 KB)
