Abstract:Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces $V_k$, and is fine-tuned under the supervision of a weak model specialized on task $\kappa$. We prove that the strong model efficiently learns task $\kappa$, eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target's. Numerical experiments on synthetic data confirm our theoretical results.
| Comments: | 48 pages, 1 figure |
| Subjects: | Machine Learning (stat.ML); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.12908 [stat.ML] |
| (or arXiv:2605.12908v1 [stat.ML] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12908 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Ryoya Awano [view email]
[v1]
Wed, 13 May 2026 02:35:58 UTC (1,165 KB)
