The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

View PDF HTML (experimental)

Abstract:Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces $V_k$, and is fine-tuned under the supervision of a weak model specialized on task $\kappa$. We prove that the strong model efficiently learns task $\kappa$, eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target's. Numerical experiments on synthetic data confirm our theoretical results.

Comments:	48 pages, 1 figure
Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:2605.12908 [stat.ML]
	(or arXiv:2605.12908v1 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2605.12908 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Ryoya Awano [view email]
[v1] Wed, 13 May 2026 02:35:58 UTC (1,165 KB)