AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

View PDF HTML (experimental)

Abstract:Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while naïve low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.

Comments:	Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing (TALSP). Copyright IEEE
Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2512.03637 [cs.SD]
	(or arXiv:2512.03637v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2512.03637 arXiv-issued DOI via DataCite
Related DOI:	https://doi.org/10.1109/TASLPRO.2026.3690632 DOI(s) linking to related resources

Submission history

From: Kohei Yamamoto [view email]
[v1] Wed, 3 Dec 2025 10:17:35 UTC (789 KB)
[v2] Thu, 14 May 2026 08:51:32 UTC (820 KB)