Abstract:Transformers are extremely successful machine learning models whose mathematical properties remain poorly understood. Here, we rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity. By viewing such transformers as discrete-time dynamical systems describing the evolution of points in a Euclidean space, and thanks to a geometric interpretation of the self-attention mechanism based on hyperplane separation, we show that the transformer inputs asymptotically converge to a clustered equilibrium determined by special points called \textit{leaders}. We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model, which effectively captures `context' by clustering meaningless words around leader words carrying the most meaning. Finally, we outline remaining challenges to bridge the gap between the mathematical analysis of transformers and their real-life implementation.
| Comments: | 23 pages, 11 figures, 1 table. Funded by the European Union (Horizon Europe MSCA project ModConFlex, grant number 101073558). Accompanying code available at: this https URL |
| Subjects: | Computation and Language (cs.CL); Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML) |
| MSC classes: | 68T07, 68T50 |
| Cite as: | arXiv:2407.01602 [cs.CL] |
| (or arXiv:2407.01602v2 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2407.01602 arXiv-issued DOI via DataCite |
|
| Journal reference: | SIAM Journal on Mathematics of Data Science 7(3):1367-1393, 2025 |
| Related DOI: | https://doi.org/10.1137/24M167086X
DOI(s) linking to related resources |
Submission history
From: Albert Alcalde [view email]
[v1]
Wed, 26 Jun 2024 16:13:35 UTC (394 KB)
[v2]
Wed, 13 May 2026 09:54:58 UTC (749 KB)
