Abstract:Decoding infant cry causes remains challenging for healthcare monitoring due to short nonstationary signals, limited annotations, and strong domain shifts across infants and datasets. We propose a compact acoustic framework that fuses mel-frequency cepstral coefficients (MFCCs), short-time Fourier transform (STFT) features, and fundamental-frequency (F0) contours within a multi-branch convolutional neural network (CNN) encoder, and models temporal dynamics using an enhanced Legendre Memory Unit (LMU). Compared to LSTMs, the LMU backbone provides stable sequence modeling with substantially fewer recurrent parameters, supporting efficient deployment. To improve cross-dataset generalization, we introduce calibrated posterior ensemble fusion with entropy-gated weighting to preserve domain-specific expertise while mitigating dataset bias. Experiments on Baby2020 and Baby Crying demonstrate improved macro-F1 under cross-domain evaluation, along with leakage aware splits and real-time feasibility for on-device monitoring.
| Comments: | 7 pages, to appear in Proc. Int. Conf. IEEE Engineering in Medicine and Biology Society (EMBC 2026), Toronto, Canada, July 26-30 2026 |
| Subjects: | Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD) |
| Cite as: | arXiv:2603.02245 [eess.AS] |
| (or arXiv:2603.02245v3 [eess.AS] for this version) | |
| https://doi.org/10.48550/arXiv.2603.02245 arXiv-issued DOI via DataCite |
Submission history
From: Martin Bouchard [view email]
[v1]
Tue, 24 Feb 2026 23:44:41 UTC (40,567 KB)
[v2]
Thu, 5 Mar 2026 20:32:59 UTC (11,192 KB)
[v3]
Wed, 13 May 2026 17:05:54 UTC (11,070 KB)
