Abstract:Sparse autoencoders (SAEs) are now standard tools for decomposing language model activations into interpretable features, and automated interpretability pipelines routinely assign each feature a short natural-language explanation. Existing critiques of this practice focus on polysemanticity -- one feature with many meanings -- or on whether explanations predict activations. We identify a complementary, structurally distinct problem we call descriptive collision: many distinct SAE features admit the same explanation. Reanalyzing the largest publicly-available dataset of human-annotated SAE features (Marks et al., 2025), comprising 722 annotated features across Gemma 2 2B and Pythia 70M, we find that the mean annotation string is reused across 3.07 features; 82.1% of features share their annotation with at least one other feature; and the single most common annotation string ("plural nouns") labels 101 distinct features spanning 18 layers and four model components. Information-theoretically, the average annotation resolves only 70% of feature identity. We formalize a property called discrimination, prove that current detection-style auto-interpretability scoring is invariant to collision, and propose two complementary corrective metrics -- collision-adjusted detection and discrimination scoring -- that explicitly penalize explanations that fail to distinguish a feature from its neighbors. The collision problem is independent of, and additive with, previously identified failure modes of auto-interpretability; ignoring it inflates reported feature interpretability by a quantity equal to roughly one-third of the bits required to identify a feature.
| Comments: | 11 pages, 2 figures, 3 tables |
| Subjects: | Machine Learning (cs.LG) |
| ACM classes: | I.2.6; I.2.7 |
| Cite as: | arXiv:2605.12874 [cs.LG] |
| (or arXiv:2605.12874v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12874 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Jordan McCann [view email]
[v1]
Wed, 13 May 2026 01:41:38 UTC (13 KB)
