Models and Methods for Speech Separation in Digital Systems

2024;
: pp. 121 - 127
1
Ivan Franko National University of Lviv, Infineon Technologies
2
Ivan Franko National University of Lviv

The main purpose of the article is to describe state-of-the-art approaches to speech separation and de- monstrate the structures and challenges of building and training such systems. Designing efficient optimized neural network model for speech recognition requires using encoder-decoder model structure with masks estimation flow. The fully-convolutinoal SuDoRM-Rf model demonst- rates the high efficiency with relatively small number of parameters and can be boosted with accelerators, that supports convolutional operations. The highest separation performance has been shown by the SepTDA model with 24 db in SI-SNR with 21.2 million of trainable parameters, while SuDoRM-Rf with only 2.66 million has demonsrated 12.2 db. Another transformer-based neural network approaches has demonstrated almost the same performance as SepTDA model but requires more trainable parameters.

  1. M. Lichouri, K. Lounnas, R. Djeradi & A. Djeradi. (2022).Performance of End-to-End vs Pipeline Spoken Language Understanding Models on Multilingual Synthetic Voice. In 2022 International Conference on Advanced Aspects of Software     Engineering        (pp.    1-6).    ICAASE.    DOI: https://doi.org/10.1109/icaase56196.2022.9931594
  2. Z.  -Q.  Wang,  J.  L.  Roux  &  J.  R.  Hershey.  (2018).Alternative  Objective  Functions  for  Deep  Clustering.In 2018  IEEE  International  Conference  on  Acoustics,Speech  and  Signal  Processing  (pp.  686-690).  ICASSP.DOI: https://doi.org/10.1109/icassp.2018.8462507
  3. E. Tzinis, S. Venkataramani, Z. Wang, C. Subakan & P.Smaragdis. (2020). Two-Step Sound Source Separation:Training On Learned Latent Targets. In ICASSP 2020 -2020 IEEE International Conference on Acoustics, Speech and   Signal   Processing   (pp.   31-35).   ICASSP.   DOI:https://doi.org/10.1109/icassp40776.2020.9054172
  4. . Luo, Z. Chen & T. Yoshioka. (2020). Dual-Path RNN:Efficient  Long  Sequence  Modeling  for  Time-Domain Single-Channel Speech Separation. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal     Processing     (pp.      46-50).     ICASSP.    DOI:https://doi.org/10.1109/icassp40776.2020.9054266
  5. J. Q. Yip et al. (2024). SPGM: Prioritizing Local Features for Enhanced Speech Separation Performance. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech  and  Signal  Processing  (pp.  326-330).  ICASSP.DOI: https://doi.org/10.1109/icassp48485.2024.10447030
  6. Pu H, Cai C, Hu M, Deng T, Zheng R, Luo J. (2021).Towards Robust Multiple Blind Source Localization Using Source Separation and Beamforming. In Sensors. DOI: https://doi.org/10.3390/s21020532
  7. D. Wang & J. Chen. (2018, October). Supervised Speech Separation Based on  Deep  Learning:  An  Overview. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 26, no. 10, pp. 1702-1726). IEEE. DOI: https://doi.org/10.1109/taslp.2018.2842159
  8. Y. Luo & N. Mesgarani. (2018). TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 696-700). ICASSP. DOI: 10.1109/ICASSP.2018.8462116
  9. Tzinis, E., Wang, Z., Jiang, X. et al. (2021). Compute and Memory Efficient Universal Sound Source Separation. In Journal of Signal Processing Systems 94 (pp. 245–259). DOI: https://doi.org/10.1007/s11265-021-01683-x
  10. J. R. Hershey, Z. Chen, J. Le Roux & S. Watanabe. (2016). Deep     clustering:     Discriminative     embeddings     for  segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 31-35). ICASSP. DOI: https://doi.org/10.1109/icassp. 2016.7471631
  11. J. L. Roux, S. Wisdom, H. Erdogan & J. R. Hershey. (2019). SDR – Half-baked or Well Done? In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 626-630). ICASSP. DOI: https://doi.org/10.1109/icassp.2019.8683855
  12. M. Kolbæk, D. Yu, Z.-H. Tan & J. Jensen. (2017). Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of  Deep  Recurrent  Neural  Networks. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 25, no. 10, pp. 1901-1913, Oct. 2017). DOI: https://doi.org/10.1109/taslp.2017.2726762
  13. Cosentino, Joris et al. (2020). LibriMix: An Open-Source Dataset  for  Generalizable  Speech  Separation.  In arXiv: Audio          and          Speech          Processing.             DOI: https://doi.org/10.48550/arxiv.2005.11262
  14. Dauphin, Yann et al. (2016). Language Modeling  with Gated     Convolutional     Networks.     In      International Conference on Machine Learning.   DOI: https://doi.org/10.48550/arxiv.1612.08083
  15. Y. Luo & N. Mesgarani. (2019, August). Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 8, pp. 1256-1266). DOI: https://doi.org/10.1109/taslp.2019.2915167
  16. E. Tzinis, Z. Wang & P. Smaragdis. (2020). Sudo RM -RF: Efficient     Networks    for      Universal       Audio    Source Separation. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (pp. 1-6). MLSPDOI: https://doi.org/10.1109/mlsp49062.2020.9231900
  17. Y. Liu & D. Wang. (2019, December). Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 12, pp. 2092- 2102). DOI: https://doi.org/10.1109/taslp.2019.2941148
  18. N. Zeghidour & D. Grangier. (2021). Wavesplit: End-to- End     Speech     Separation     by     Speaker       Clustering. In IEEE/ACM   Transactions   on   Audio,   Speech,   and Language  Processing  (vol.  29,  pp.  2840-2849).  DOI:https://doi.org/10.1109/taslp.2021.3099291
  19. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi & J. Zhong. (2021). Attention Is All You Need In Speech Separation. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 21-25). ICASSP. DOI: https://doi.org/10.1109/icassp 39728.2021.9413901
  20. S. Zhao & B. Ma. (2023). MossFormer: Pushing the Performance Limit of Monaural Speech Separation Using Gated Single-Head Transformer with Convolution- Augmented Joint Self-Attentions. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1-5). ICASSP. DOI: https://doi.org/10.1109/icassp49357.2023.10096646
  21. Lutati, Shahar et al. (2023). Separate and Diffuse: Using a Pretrained   Diffusion   Model   for   Improving   Source Separation. In ArXiv abs/2301.10752. DOI: https://doi.org/ 10.48550/arxiv.2301.10752
  22. S. Zhao et al. (2024). MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 10356- 10360). ICASSP. DOI: https://doi.org/10.1109/ ICASSP48485.2024.10445985
  23. Y. Lee, S. Choi, B. -Y. Kim, Z. -Q. Wang & S. Watanabe. (2024). Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 446-450). ICASSP. DOI: https://doi.org/10.1109/icassp48485.2024.10446032