Models and Methods for Speech Separation in Digital Systems

: pp. 121 - 127
Ivan Franko National University of Lviv, Infineon Technologies
Ivan Franko National University of Lviv

The main purpose of the article is to describe state-of-the-art approaches to speech separation and de- monstrate the structures and challenges of building and training such systems. Designing efficient optimized neural network model for speech recognition requires using encoder-decoder model structure with masks estimation flow. The fully-convolutinoal SuDoRM-Rf model demonst- rates the high efficiency with relatively small number of parameters and can be boosted with accelerators, that supports convolutional operations. The highest separation performance has been shown by the SepTDA model with 24 db in SI-SNR with 21.2 million of trainable parameters, while SuDoRM-Rf with only 2.66 million has demonsrated 12.2 db. Another transformer-based neural network approaches has demonstrated almost the same performance as SepTDA model but requires more trainable parameters.

  1. M. Lichouri, K. Lounnas, R. Djeradi & A. Djeradi. (2022).Performance of End-to-End vs Pipeline Spoken Language Understanding Models on Multilingual Synthetic Voice. In 2022 International Conference on Advanced Aspects of Software     Engineering        (pp.    1-6).    ICAASE.    DOI:
  2. Z.  -Q.  Wang,  J.  L.  Roux  &  J.  R.  Hershey.  (2018).Alternative  Objective  Functions  for  Deep  Clustering.In 2018  IEEE  International  Conference  on  Acoustics,Speech  and  Signal  Processing  (pp.  686-690).  ICASSP.DOI:
  3. E. Tzinis, S. Venkataramani, Z. Wang, C. Subakan & P.Smaragdis. (2020). Two-Step Sound Source Separation:Training On Learned Latent Targets. In ICASSP 2020 -2020 IEEE International Conference on Acoustics, Speech and   Signal   Processing   (pp.   31-35).   ICASSP.   DOI:
  4. . Luo, Z. Chen & T. Yoshioka. (2020). Dual-Path RNN:Efficient  Long  Sequence  Modeling  for  Time-Domain Single-Channel Speech Separation. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal     Processing     (pp.      46-50).     ICASSP.    DOI:
  5. J. Q. Yip et al. (2024). SPGM: Prioritizing Local Features for Enhanced Speech Separation Performance. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech  and  Signal  Processing  (pp.  326-330).  ICASSP.DOI:
  6. Pu H, Cai C, Hu M, Deng T, Zheng R, Luo J. (2021).Towards Robust Multiple Blind Source Localization Using Source Separation and Beamforming. In Sensors. DOI:
  7. D. Wang & J. Chen. (2018, October). Supervised Speech Separation Based on  Deep  Learning:  An  Overview. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 26, no. 10, pp. 1702-1726). IEEE. DOI:
  8. Y. Luo & N. Mesgarani. (2018). TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 696-700). ICASSP. DOI: 10.1109/ICASSP.2018.8462116
  9. Tzinis, E., Wang, Z., Jiang, X. et al. (2021). Compute and Memory Efficient Universal Sound Source Separation. In Journal of Signal Processing Systems 94 (pp. 245–259). DOI:
  10. J. R. Hershey, Z. Chen, J. Le Roux & S. Watanabe. (2016). Deep     clustering:     Discriminative     embeddings     for  segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 31-35). ICASSP. DOI: 2016.7471631
  11. J. L. Roux, S. Wisdom, H. Erdogan & J. R. Hershey. (2019). SDR – Half-baked or Well Done? In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 626-630). ICASSP. DOI:
  12. M. Kolbæk, D. Yu, Z.-H. Tan & J. Jensen. (2017). Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of  Deep  Recurrent  Neural  Networks. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 25, no. 10, pp. 1901-1913, Oct. 2017). DOI:
  13. Cosentino, Joris et al. (2020). LibriMix: An Open-Source Dataset  for  Generalizable  Speech  Separation.  In arXiv: Audio          and          Speech          Processing.             DOI:
  14. Dauphin, Yann et al. (2016). Language Modeling  with Gated     Convolutional     Networks.     In      International Conference on Machine Learning.   DOI:
  15. Y. Luo & N. Mesgarani. (2019, August). Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 8, pp. 1256-1266). DOI:
  16. E. Tzinis, Z. Wang & P. Smaragdis. (2020). Sudo RM -RF: Efficient     Networks    for      Universal       Audio    Source Separation. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (pp. 1-6). MLSPDOI:
  17. Y. Liu & D. Wang. (2019, December). Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 12, pp. 2092- 2102). DOI:
  18. N. Zeghidour & D. Grangier. (2021). Wavesplit: End-to- End     Speech     Separation     by     Speaker       Clustering. In IEEE/ACM   Transactions   on   Audio,   Speech,   and Language  Processing  (vol.  29,  pp.  2840-2849).  DOI:
  19. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi & J. Zhong. (2021). Attention Is All You Need In Speech Separation. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 21-25). ICASSP. DOI: 39728.2021.9413901
  20. S. Zhao & B. Ma. (2023). MossFormer: Pushing the Performance Limit of Monaural Speech Separation Using Gated Single-Head Transformer with Convolution- Augmented Joint Self-Attentions. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1-5). ICASSP. DOI:
  21. Lutati, Shahar et al. (2023). Separate and Diffuse: Using a Pretrained   Diffusion   Model   for   Improving   Source Separation. In ArXiv abs/2301.10752. DOI: 10.48550/arxiv.2301.10752
  22. S. Zhao et al. (2024). MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 10356- 10360). ICASSP. DOI: ICASSP48485.2024.10445985
  23. Y. Lee, S. Choi, B. -Y. Kim, Z. -Q. Wang & S. Watanabe. (2024). Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 446-450). ICASSP. DOI: