Abstract:Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while naïve low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.
| Comments: | Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing (TALSP). Copyright IEEE |
| Subjects: | Sound (cs.SD); Machine Learning (cs.LG); Machine Learning (stat.ML) |
| Cite as: | arXiv:2512.03637 [cs.SD] |
| (or arXiv:2512.03637v2 [cs.SD] for this version) | |
| https://doi.org/10.48550/arXiv.2512.03637 arXiv-issued DOI via DataCite |
|
| Related DOI: | https://doi.org/10.1109/TASLPRO.2026.3690632
DOI(s) linking to related resources |
Submission history
From: Kohei Yamamoto [view email]
[v1]
Wed, 3 Dec 2025 10:17:35 UTC (789 KB)
[v2]
Thu, 14 May 2026 08:51:32 UTC (820 KB)
