Abstract:Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the distillation divergence admits an interpretable decomposition into bias, variance, and rank-bottleneck costs, yielding practical guidance for distillation design.
| Comments: | 6 pages, accepted at ISIT 2026 |
| Subjects: | Information Theory (cs.IT); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.13143 [cs.IT] |
| (or arXiv:2605.13143v1 [cs.IT] for this version) | |
| https://doi.org/10.48550/arXiv.2605.13143 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Bingying Li [view email]
[v1]
Wed, 13 May 2026 08:10:05 UTC (93 KB)
