Models and Methods for Speech Separation in Digital Systems

2024;
: pp. 121 - 127
1
Ivan Franko National University of Lviv, Infineon Technologies
2
Ivan Franko National University of Lviv

The main purpose of the article is to describe state-of-the-art approaches to speech separation and de- monstrate the structures and challenges of building and training such systems. Designing efficient optimized neural network model for speech recognition requires using encoder-decoder model structure with masks estimation flow. The fully-convolutinoal SuDoRM-Rf model demonst- rates the high efficiency with relatively small number of parameters and can be boosted with accelerators, that supports convolutional operations. The highest separation performance has been shown by the SepTDA model with 24 db in SI-SNR with 21.2 million of trainable parameters, while SuDoRM-Rf with only 2.66 million has demonsrated 12.2 db. Another transformer-based neural network approaches has demonstrated almost the same performance as SepTDA model but requires more trainable parameters.

  1. M. Lichouri, K. Lounnas, R. Djeradi & A. Djeradi. (2022).Performance of End-to-End vs Pipeline Spoken Language Understanding Models on Multilingual Synthetic Voice. In 2022 International Conference on Advanced Aspects of Software Engineering (pp. 1-6). ICAASE. DOI: https://doi.org/10.1109/ICAASE56196.2022.9931594
  2. Z. -Q. Wang, J. L. Roux & J. R. Hershey. (2018).Alternative Objective Functions for Deep Clustering.In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 686-690). ICASSP.DOI: https://doi.org/10.1109/ICASSP.2018.8462507
  3. E. Tzinis, S. Venkataramani, Z. Wang, C. Subakan & P.Smaragdis. (2020). Two-Step Sound Source Separation: Training On Learned Latent Targets. In ICASSP 2020 -2020 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 31-35). ICASSP. DOI: https://doi.org/10.1109/ICASSP40776.2020.9054172
  4. . Luo, Z. Chen & T. Yoshioka. (2020). Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 46-50). ICASSP. DOI: https://doi.org/10.1109/ICASSP40776.2020.9054266
  5. J. Q. Yip et al. (2024). SPGM: Prioritizing Local Features for Enhanced Speech Separation Performance. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 326-330). ICASSP. DOI: https://doi.org/10.1109/ICASSP48485.2024.10447030
  6. Pu H, Cai C, Hu M, Deng T, Zheng R, Luo J. (2021).Towards Robust Multiple Blind Source Localization Using Source Separation and Beamforming. In Sensors. DOI: https://doi.org/10.3390/s21020532
  7. D. Wang & J. Chen. (2018, October). Supervised Speech Separation Based on Deep Learning: An Overview. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 26, no. 10, pp. 1702-1726). IEEE. DOI: https://doi.org/10.1109/TASLP.2018.2842159
  8. Y. Luo & N. Mesgarani. (2018). TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 696-700). ICASSP. DOI: https://doi.org/10.1109/ICASSP.2018.8462116
  9. Tzinis, E., Wang, Z., Jiang, X. et al. (2021). Compute and Memory Efficient Universal Sound Source Separation. In Journal of Signal Processing Systems 94 (pp. 245-259). DOI: https://doi.org/10.1007/s11265-021-01683-x
  10. J. R. Hershey, Z. Chen, J. Le Roux & S. Watanabe. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 31-35). ICASSP. DOI: https://doi.org/10.1109/ICASSP.2016.7471631
  11. J. L. Roux, S. Wisdom, H. Erdogan & J. R. Hershey. (2019). SDR - Half-baked or Well Done? In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 626-630). ICASSP. DOI: https://doi.org/10.1109/ICASSP.2019.8683855
  12. M. Kolbæk, D. Yu, Z.-H. Tan & J. Jensen. (2017). Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 25, no. 10, pp. 1901-1913, Oct. 2017). DOI: https://doi.org/10.1109/TASLP.2017.2726762
  13. Cosentino, Joris et al. (2020). LibriMix: An Open-Source Dataset for Generalizable Speech Separation. In arXiv: Audio and Speech Processing. DOI: https://doi.org/10.48550/arxiv.2005.11262
  14. Dauphin, Yann et al. (2016). Language Modeling with Gated Convolutional Networks. In International Conference on Machine Learning. DOI: https://doi.org/10.48550/arxiv.1612.08083
  15. Y. Luo & N. Mesgarani. (2019, August). Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 8, pp. 1256-1266). DOI: https://doi.org/10.1109/TASLP.2019.2915167
  16. E. Tzinis, Z. Wang & P. Smaragdis. (2020). Sudo RM -RF: Efficient Networks for Universal Audio Source Separation. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (pp. 1-6). MLSP. DOI: https://doi.org/10.1109/MLSP49062.2020.9231900
  17. Y. Liu & D. Wang. (2019, December). Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 12, pp. 2092- 2102). DOI: https://doi.org/10.1109/TASLP.2019.2941148
  18. N. Zeghidour & D. Grangier. (2021). Wavesplit: End-to- End Speech Separation by Speaker Clustering. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 29, pp. 2840-2849). DOI: https://doi.org/10.1109/TASLP.2021.3099291
  19. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi & J. Zhong. (2021). Attention Is All You Need In Speech Separation. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 21-25). ICASSP. DOI: https://doi.org/10.1109/ICASSP39728.2021.9413901
  20. S. Zhao & B. Ma. (2023). MossFormer: Pushing the Performance Limit of Monaural Speech Separation Using Gated Single-Head Transformer with Convolution- Augmented Joint Self-Attentions. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1-5). ICASSP. DOI: https://doi.org/10.1109/ICASSP49357.2023.10096646
  21. Lutati, Shahar et al. (2023). Separate and Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation. In ArXiv abs/2301.10752. DOI: https://doi.org/ 10.48550/arxiv.2301.10752
  22. S. Zhao et al. (2024). MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 10356- 10360). ICASSP. DOI: https://doi.org/10.1109/ICASSP48485.2024.10445985
  23. Y. Lee, S. Choi, B. -Y. Kim, Z. -Q. Wang & S. Watanabe. (2024). Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 446-450). ICASSP. DOI: https://doi.org/10.1109/ICASSP48485.2024.10446032