Abstract:Sparse autoencoders (SAEs) decompose transformer residual streams into interpretable feature dictionaries, yet the relationship between SAE width and causal influence on model output has not been systematically characterised. We introduce causal dimensionality kappa(L, M, T), defined as the effective rank of the expected Jacobian outer product at layer L, and show it can be estimated via the SAE width sweep paired with attribution patching. Across seven SAE widths from 16,384 to 1,048,576 features on Gemma-2-2B layer 12, representational capacity grows 15.6x while causal capacity grows only 4.35x: a robust separation we term the representational-causal wedge. A saturating fit yields kappa-hat approximately 1,990 with kappa-hat / d_model = 0.86 and participation-ratio lower bound kappa_PR approximately 280. Crucially, kappa is invariant to model scaling: Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 at the same SAE width despite a 3.46x parameter increase (the count is forced to 2% of SAE width by calibration; the substantive empirical claim is shape invariance of the AtP score distribution under matched seq=512 conditions). Across eight network depths kappa is constant while the absolute attribution threshold drops 20x from layer 1 to layer 23. Five controls (architecture invariance, threshold robustness, geometric privilege, synthetic ground-truth recovery, and a four-cell encoder/decoder ablation) pin down what kappa measures and what it does not. Our findings establish kappa as a measurable, model-intrinsic property of transformer layers: sub-linearly recoverable by SAE width, invariant to model scaling, and structured across network depth.
| Comments: | 9 pages, 17 figures, 14 tables (excluding references and appendices). Companion short paper under review at the ICML 2026 Mechanistic Interpretability Workshop. Code: this https URL |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI) |
| MSC classes: | 68T07, 68T50 |
| ACM classes: | I.2.6; I.2.7; I.2.0 |
| Cite as: | arXiv:2605.08740 [cs.LG] |
| (or arXiv:2605.08740v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.08740 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Nilesh Sarkar [view email]
[v1]
Sat, 9 May 2026 07:05:26 UTC (4,489 KB)
