Abstract:KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\cite{deepseek2025r1}) at budgets $b \in \{64, 128\}$. All seven were rejected. We then propose $\alpha$, a one-function modification to the TriAttention~\cite{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $\lambda$. A pre-registered protocol tunes $\lambda$ on a frozen development split and confirms on a disjoint held-out split; with $\lambda = 0.5$, $\alpha$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.
| Comments: | 12 pages, 2 figures, 3 tables. Code and data: this https URL |
| Subjects: | Machine Learning (cs.LG); Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.14292 [cs.LG] |
| (or arXiv:2605.14292v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.14292 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Libo Sun [view email]
[v1]
Thu, 14 May 2026 02:50:20 UTC (81 KB)
