This article presents a hybrid speech separation model designed for efficient deployment on edge devices, focusing on optimizing both performance and computational resources. This study proposes a novel hybrid architecture that combines the strengths of Conv-TasNet and SuDoRM- RF models, leveraging their fully-convolutional structures to achieve efficient separation with minimal resource usage. The proposed model has obtained a separation performance of 10.59 db in SI-SDRi for clean Libri2Mix dataset for only 1.17 M parameters with only 0.92 GMACs/s.
[1] Cosentino, J., Pariente, M., Cornell, S., Deleforge, A., & Vincent, E. (2020). LibriMix: An open-source dataset for generalizable speech separation. ArXiv preprint arXiv:2005.11262. DOI:https://doi.org/10.48550/arXiv.2005.11262.
[2] Hershey, J. R., Chen, Z., Le Roux, J., & Watanabe, S. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) 2016. DOI: https://doi.org/10.1109/ICASSP.2016.7471631.
[3] Maciejewski, M., Wichern, G., McQuinn, E., & Le Roux, J. (2020). WHAMR!: Noisy and reverberant single- channel speech separation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) 2020. DOI: https://doi.org/10.48550/arXiv.1910.10279.
[4] Lichouri, M., Lounnas, K., Djeradi, R., & Djeradi, A. (2022). Performance of end-to-end vs pipeline spoken language understanding models on multilingual synthetic voice. In Proc. 5th Int. Conf. Advanced Aspects of Software Engineering (ICAASE’22) (Constantine, Algeria), Sep. 2022. DOI: https://doi.org/10.1109/ICAASE5 6196.2022.9931594.
[5] Zhao, S., Ma, Y., Ni, C., Zhang, C., Wang, H., Nguyen, T. H., Zhou, K., Yip, J., Ng, D., & Ma, B. (2024). MossFormer2: Combining transformer and RNN-free recurrent network for enhanced time-domain monaural speech separation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) 2024. DOI: https://doi.org/10.48550/arXiv.2312.11825.
[6] Rixen, J., & Renz, M. (2022). SFSRNet: Super-resolution for single-channel audio source separation. Proc. AAAI-22, 36th AAAI Conf. Artificial Intelligence, 36(10), 11220– 11228. DOI: https://doi.org/10.1609/aaai.v36i10.21372.
[7] Le Roux, J., Wisdom, S., Erdoğan, H., & Hershey, J. R. (2018). SDR – half-baked or well done? ArXiv preprint arXiv:1811.02508. DOI: https://doi.org/10.48550/arXiv.1811.02508.
[8] Wang, Z.-Q., Cornell, S., Choi, S., Lee, Y., Kim, B.-Y., & Watanabe, S. (2023). TF-GridNet: Making time-frequency domain models great again for monaural speaker separation. Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) 2023. DOI: https://doi.org/10.48550/arXiv.2209.03952.
[9] Li, K., Chen, G., Yang, R., & Hu, X. (2024). SPMamba: State-space model is all you need in speech separation. ArXiv preprint arXiv:2404.02063. DOI: https://doi.org/10.48550/arXiv.2404.02063.
[10] Luo, Y., & Mesgarani, N. (2019). Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio, Speech, Language Process., 27(8), 1256–1266. DOI: https://doi.org/10.1109/TASLP.2019.2915167.
[11] Tzinis, E., Wang, Z., & Smaragdis, P. (2020). Sudo rm -rf: Efficient networks for universal audio source separation. In Proc. 2020 IEEE Int. Workshop on Machine Learning for Signal Processing (MLSP). DOI: https://doi.org/10.1109/MLSP49062.2020.9231900.
[12] Li, K., Chen, G., Sang, W., Luo, Y., Chen, Z., Wang, S., He, S., Wang, Z.-Q., Li, A., Wu, Z., & Hu, X. (2025). Advances in speech separation: Techniques, challenges, and future trends. ArXiv preprint arXiv:2508.10830. DOI: https://doi.org/10.48550/arXiv.2508.10830.
[13] Luo, Y., Chen, Z., & Yoshioka, T. (2020). Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) 2020. DOI: https://doi.org/10.1109/ICASSP40776.2020.9054266.
[14] Yang, L., Liu, W., & Wang, W. (2022). TFPSNet: Time-frequency domain path scanning network for speech separation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) 2022. DOI: https://doi.org/10.1109/ICASSP43922.2022.9747554.
[15] Xu, M., Li, K., Chen, G., & Hu, X. (2025). TIGER: Time-frequency interleaved gain extraction and reconstruction for efficient speech separation. Proc. International Conf. on Learning Representations (ICLR) 2025. DOI: https://doi.org/10.48550/arXiv.2410.01469.
[16] Tsemko, A., Santra, A., Kapshii, O., & Pandey, A. (2024). Data-driven processing using parametric neural network for improved Bluetooth channel sounding distance estimation. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) 2025, 1–5.
[17] Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) 2015, 5206–5210. DOI: https://doi.org/10.1109/ICASSP.2015.7178964.
[18] Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proc. 3rd Int. Conf. Learning Representations (ICLR) 2015, San Diego, CA. DOI: https://doi.org/10.48550/arXiv.1412.6980.
[19] Arm Ethos-U55. [Electronic resource]. Arm Developer. Available: https://developer.arm.com/Processors/Ethos- U55