Abstract:Missing data imputation, where a model is trained on observed data to estimate unobserved values, is a fundamental problem in machine learning. In this paper, we rigorously formulate imputation model learning as a mean-squared error risk minimisation problem. We show that when the probability of missingness depends on the data, many state-of-the-art methods fail to account for the resulting distribution shift between the observed data used for training and the full data distribution used for evaluation. Consequently, these approaches do not minimise mean-squared error on the full data distribution. Instead, we propose a novel imputation algorithm designed to learn an imputation model from the observed data while explicitly accounting for this distribution shift. Simulation studies show consistent improvements over otherwise identical uncorrected baselines, with average reductions of 3% in RMSE and 7% in Wasserstein distance.
| Comments: | 9 pages, 12 figures |
| Subjects: | Machine Learning (stat.ML); Machine Learning (cs.LG) |
| Cite as: | arXiv:2602.06713 [stat.ML] |
| (or arXiv:2602.06713v2 [stat.ML] for this version) | |
| https://doi.org/10.48550/arXiv.2602.06713 arXiv-issued DOI via DataCite |
Submission history
From: Luke Shannon [view email]
[v1]
Fri, 6 Feb 2026 14:02:12 UTC (700 KB)
[v2]
Wed, 13 May 2026 09:23:06 UTC (734 KB)
