On the Generalization of Knowledge Distillation: An Information-Theoretic View

View PDF HTML (experimental)

Abstract:Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the distillation divergence admits an interpretable decomposition into bias, variance, and rank-bottleneck costs, yielding practical guidance for distillation design.

Comments:	6 pages, accepted at ISIT 2026
Subjects:	Information Theory (cs.IT); Machine Learning (cs.LG)
Cite as:	arXiv:2605.13143 [cs.IT]
	(or arXiv:2605.13143v1 [cs.IT] for this version)
	https://doi.org/10.48550/arXiv.2605.13143 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Bingying Li [view email]
[v1] Wed, 13 May 2026 08:10:05 UTC (93 KB)