Models and Methods for Speech Separation in Digital Systems

Andrii Tsemko; Ivan Karbovnyk

The main purpose of the article is to describe state-of-the-art approaches to speech separation and de- monstrate the structures and challenges of building and training such systems. Designing efficient optimized neural network model for speech recognition requires using encoder-decoder model structure with masks estimation flow. The fully-convolutinoal SuDoRM-Rf model demonst- rates the high efficiency with relatively small number of parameters and can be boosted with accelerators, that supports convolutional operations. The highest separation performance has been shown by the SepTDA model with 24 db in SI-SNR with 21.2 million of trainable parameters, while SuDoRM-Rf with only 2.66 million has demonsrated 12.2 db. Another transformer-based neural network approaches has demonstrated almost the same performance as SepTDA model but requires more trainable parameters.

M. Lichouri, K. Lounnas, R. Djeradi & A. Djeradi. (2022).Performance of End-to-End vs Pipeline Spoken Language Understanding Models on Multilingual Synthetic Voice. In 2022 International Conference on Advanced Aspects of Software Engineering (pp. 1-6). ICAASE. DOI: https://doi.org/10.1109/icaase56196.2022.9931594
Z. -Q. Wang, J. L. Roux & J. R. Hershey. (2018).Alternative Objective Functions for Deep Clustering.In 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (pp. 686-690). ICASSP.DOI: https://doi.org/10.1109/icassp.2018.8462507
E. Tzinis, S. Venkataramani, Z. Wang, C. Subakan & P.Smaragdis. (2020). Two-Step Sound Source Separation:Training On Learned Latent Targets. In ICASSP 2020 -2020 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 31-35). ICASSP. DOI:https://doi.org/10.1109/icassp40776.2020.9054172
. Luo, Z. Chen & T. Yoshioka. (2020). Dual-Path RNN:Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 46-50). ICASSP. DOI:https://doi.org/10.1109/icassp40776.2020.9054266
J. Q. Yip et al. (2024). SPGM: Prioritizing Local Features for Enhanced Speech Separation Performance. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 326-330). ICASSP.DOI: https://doi.org/10.1109/icassp48485.2024.10447030
Pu H, Cai C, Hu M, Deng T, Zheng R, Luo J. (2021).Towards Robust Multiple Blind Source Localization Using Source Separation and Beamforming. In Sensors. DOI: https://doi.org/10.3390/s21020532
D. Wang & J. Chen. (2018, October). Supervised Speech Separation Based on Deep Learning: An Overview. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 26, no. 10, pp. 1702-1726). IEEE. DOI: https://doi.org/10.1109/taslp.2018.2842159
Y. Luo & N. Mesgarani. (2018). TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 696-700). ICASSP. DOI: 10.1109/ICASSP.2018.8462116
Tzinis, E., Wang, Z., Jiang, X. et al. (2021). Compute and Memory Efficient Universal Sound Source Separation. In Journal of Signal Processing Systems 94 (pp. 245–259). DOI: https://doi.org/10.1007/s11265-021-01683-x
J. R. Hershey, Z. Chen, J. Le Roux & S. Watanabe. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 31-35). ICASSP. DOI: https://doi.org/10.1109/icassp. 2016.7471631
J. L. Roux, S. Wisdom, H. Erdogan & J. R. Hershey. (2019). SDR – Half-baked or Well Done? In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 626-630). ICASSP. DOI: https://doi.org/10.1109/icassp.2019.8683855
M. Kolbæk, D. Yu, Z.-H. Tan & J. Jensen. (2017). Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 25, no. 10, pp. 1901-1913, Oct. 2017). DOI: https://doi.org/10.1109/taslp.2017.2726762
Cosentino, Joris et al. (2020). LibriMix: An Open-Source Dataset for Generalizable Speech Separation. In arXiv: Audio and Speech Processing. DOI: https://doi.org/10.48550/arxiv.2005.11262
Dauphin, Yann et al. (2016). Language Modeling with Gated Convolutional Networks. In International Conference on Machine Learning. DOI: https://doi.org/10.48550/arxiv.1612.08083
Y. Luo & N. Mesgarani. (2019, August). Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 8, pp. 1256-1266). DOI: https://doi.org/10.1109/taslp.2019.2915167
E. Tzinis, Z. Wang & P. Smaragdis. (2020). Sudo RM -RF: Efficient Networks for Universal Audio Source Separation. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (pp. 1-6). MLSP. DOI: https://doi.org/10.1109/mlsp49062.2020.9231900
Y. Liu & D. Wang. (2019, December). Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 12, pp. 2092- 2102). DOI: https://doi.org/10.1109/taslp.2019.2941148
N. Zeghidour & D. Grangier. (2021). Wavesplit: End-to- End Speech Separation by Speaker Clustering. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 29, pp. 2840-2849). DOI:https://doi.org/10.1109/taslp.2021.3099291
C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi & J. Zhong. (2021). Attention Is All You Need In Speech Separation. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 21-25). ICASSP. DOI: https://doi.org/10.1109/icassp 39728.2021.9413901
S. Zhao & B. Ma. (2023). MossFormer: Pushing the Performance Limit of Monaural Speech Separation Using Gated Single-Head Transformer with Convolution- Augmented Joint Self-Attentions. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1-5). ICASSP. DOI: https://doi.org/10.1109/icassp49357.2023.10096646
Lutati, Shahar et al. (2023). Separate and Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation. In ArXiv abs/2301.10752. DOI: https://doi.org/ 10.48550/arxiv.2301.10752
S. Zhao et al. (2024). MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 10356- 10360). ICASSP. DOI: https://doi.org/10.1109/ ICASSP48485.2024.10445985
Y. Lee, S. Choi, B. -Y. Kim, Z. -Q. Wang & S. Watanabe. (2024). Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 446-450). ICASSP. DOI: https://doi.org/10.1109/icassp48485.2024.10446032