基于深度学习的声音事件检测综述Review of Deep Learning-Based Sound Event Detection
刘鹏,张振亚,王萍,程红梅
摘要(Abstract):
声音事件检测具有高隐私保护性、高环境适应性和低成本等优点,可有效应用于室内活动识别与安全防范。文章综述了基于深度学习的声音事件检测方法,并在声音数据集的研究中重点阐述了弱标签训练和小样本学习的方法。对于声音信号特征提取,本研究介绍了常用的方法及其改进,并展示了组合特征与多尺度特征融合的最新成果;还介绍了声音事件识别模型的评估标准,归纳和分析了基于深度学习的模型,并讨论了当前面临的挑战和未来研究的方向。
关键词(KeyWords): 声音事件检测;深度学习;弱标签训练;小样本学习;特征提取
基金项目(Foundation): 安徽省特支计划创新领军人才(皖组办【2022】21号);; 安徽省高等学校自然科学研究重点项目(KJ2020A470);; 安徽省高校优秀青年人才支持计划项目(gxyq2022030)
作者(Author): 刘鹏,张振亚,王萍,程红梅
DOI: 10.13757/j.cnki.cn34-1328/n.2025.03.006
参考文献(References):
- [1]KIM J, MIN K, JUNG M, et al. Occupant behavior monitoring and emergency event detection in single-person households using deep learning-based sound recognition[J]. Building and Environment, 2020, 181:107092.
- [2]PANDEYA Y R, BHATTARAI B, LEE J. Visual object detector for cow sound event detection[J]. IEEE Access, 2020, 8:162625-162633.
- [3]TSAO Y, LIN T H, CHEN F, et al. Robust s1 and s2 heart sound recognition based on spectral restoration and multi-style training[J].Biomedical Signal Processing and Control, 2019, 49:173-180.
- [4]CHAN T K, CHIN C S. A comprehensive review of polyphonic sound event detection[J]. IEEE Access, 2020, 8:103339-103373.
- [5]MESAROS A, HEITTOLA T, VIRTANEN T. TUT database for acoustic scene classification and sound event detection[C]. 2016 24th European Signal Processing Conference, 2016.
- [6]SERIZEL R, TURPAULT N, EGHBAL-ZADEH H, et al. Large-scale weakly labeled semi-supervised sound event detection in domestic environments[J]. arXiv Preprint arXiv:1807. 10501, 2018.
- [7]TARVAINEN A, VALPOLA H. Mean teachers are better role models:weight-averaged consistency targets improve semi-supervised deep learning results[J]. arXiv Preprint arXiv:1703. 01780, 2017.
- [8]王金甲,杨倩,崔琳,等.基于平均教师模型的弱标记半监督声音事件检测[J].复旦学报(自然科学版), 2020, 59(5):540-550.
- [9]KIM N K, KIM H K. Polyphonic sound event detection based on residual convolutional recurrent neural network with semi-supervised loss function[J]. IEEE Access, 2021, 9:7564-7575.
- [10]LIU Y Z, CHEN H T, ZHAO Q W, et al. Master-teacher-student:a weakly labelled semi-supervised framework for audio tagging and sound event detection[J]. IEICE Transactions on Information and Systems, 2022, E105.D(4):828-831.
- [11]KONG Q Q, XU Y, SOBIERAJ I, et al. Sound event detection and time-frequency segmentation from weakly labelled data[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(4):777-787.
- [12]HYUN S H. Sound-event detection of water-usage activities using transfer learning[J]. Sensors, 2024, 24(1):22.
- [13]GUNAWAN K W, HIDAYAT A A, CENGGORO T W, et al. Repurposing transfer learning strategy of computer vision for owl sound classification[J]. Procedia Computer Science, 2023, 216:424-430.
- [14]SALAMON J, BELLO J P. Deep convolutional neural networks and data augmentation for environmental sound classification[J]. IEEE Signal Processing Letters, 2017, 24(3):279-283.
- [15]PARK D S, CHAN W, ZHANG Y, et al. Specaugment:a simple data augmentation method for automatic speech recognition[C]. Interspeech, 2019.
- [16]SHARAN R V, MOIR T J. Noise robust audio surveillance using reduced spectrogram image feature and one-against-all SVM[J]. Neurocomputing, 2015, 158:90-99.
- [17]MIKAMI N, UEKI Y, MASAHIKO S, et al. State sensing of bubble jet flow based on acoustic recognition and deep learning[J]. International Journal of Multiphase Flow, 2023, 159:104340.
- [18]GUO J M, LI C K, SUN Z P, et al. A deep attention model for environmental sound classification from multi-feature data[J]. Applied Sciences, 2022, 12(12):5988.
- [19]SHI L K, DU K, ZHANG C Z, et al. Lung sound recognition algorithm based on VGGish-BiGRU[J]. IEEE Access, 2019, 7:139438-139449.
- [20]HAYASHI T, WATANABE S, TODA T, et al. Duration-controlled LSTM for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(11):2059-2070.
- [21]LI R X, LI W J, YUE K Q, et al. Convolutional neural network for screening of obstructive sleep apnea using snoring sounds[J]. Biomedical Signal Processing and Control, 2023, 86:104966.
- [22]MARTIN MORATO I, COBOS M, FERRI F J. On the robustness of deep features for audio event classification in adverse environments[C].2018 14th IEEE International Conference on Signal Processing, 2018.
- [23]PENG N, CHEN A B, ZHOU G X, et al. Environment sound classification based on visual multi-feature fusion and GRU-AWS[J].IEEE Access, 2020, 8:191100-191114.
- [24]WANG Y B, ZHAO G H, XIONG K, et al. Multi-scale and single-scale fully convolutional networks for sound event detection[J]. Neurocomputing, 2020, 421:51-65.
- [25]WANG Y B, ZHAO G H, XIONG K, et al. Msff-net:multi-scale feature fusing networks with dilated mixed convolution and cascaded parallel framework for sound event detection[J]. Digital Signal Processing, 2022, 122:103319.
- [26]LIU B, CHEN Z Y, QIAN Y M. Depth-first neural architecture with attentive feature fusion for efficient speaker verification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31:1825-1838.
- [27]MESAROS A, HEITTOLA T, VIRTANEN T. Metrics for polyphonic sound event detection[J]. Applied Sciences, 2016, 6(6):162.
- [28]PARASCANDOLO G, HUTTUNEN H, VIRTANEN T. Recurrent neural networks for polyphonic sound event detection in real life recordings[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.
- [29]GONG Y, CHUNG Y A, GLASS J. Ast:audio spectrogram transformer[J]. arXiv Preprint arXiv:2104. 01778, 2021.
- [30]CHEN K, DU X J, ZHU B L, et al. Hts-at:a hierarchical token-semantic audio transformer for sound classification and detection[J]. arXiv Preprint arXiv:2202. 00874, 2022.
- [31]CAKIR E, PARASCANDOLO G, HEITTOLA T, et al. Convolutional recurrent neural networks for polyphonic sound event detection[J].IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6):1291-1303.
- [32]FU Y W, XU K L, MI H B, et al. Multi model-based distillation for sound event detection[J]. IEICE Transactions on Information and Systems,2019, E102.D(10):2055-2058.