IMPACT OF AUDIO SIGNAL DURATION ON THE ACCURACY OF SPEAKER VOICE IDENTIFICATION

Volodymyr Povoroznyk; Ihor Mykytyn

This paper investigates the capability of a system based on voice embeddings to identify speakers. We use a set of audio recordings from five speakers and construct clips of varying durations – 5 to 600 seconds. Pyannote-audio embeddings are extracted by a neural network, after which similarity coefficients are computed between embeddings of clips from the same speaker (intra-speaker similarity) and from different speakers (inter-speaker dissimilarity). We study how clip duration affects the protection zone when separating speakers into “own/other.” The experiments show that there exists a certain clip duration that yields a relatively wide protection zone, which raises the probability of accurate voice-based identification. The results may be used in future research on biometric verification

Voice biometrics

voice embedding

neural networks

speaker identification

similarity coefficient 1. Introduction

Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). ECAPA-TDNN: Emphasized Channel Attention, Propa- gation and Aggregation in TDNN Based Speaker Verifi- cation. Interspeech 2020. [Online]. Available: https://www.isca-speech.org/archive/Interspeech_2020/ pdfs/1137.pdf
Bredin, H., Laurent, A., & Gillies, A. (2020). Pyannote.audio: neural building blocks for speaker diarization. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020). [Online]. Available: https://ieeexplore.ieee.org/document/9053198
Ruda, H., Sabodashko, D., Mykytyn, H., Shved, M., Borduliak, S., & Korshun, N. (2024). Specifics of creating and distributing phishing web resources. Cybersecurity: Education, Science, Technique, 2(12), 80–88. Available: https://csecurity.kubg.edu.ua/index.php/journal/article/view/ 645/508
A. Levy, B. Riva Shalom, and M. Chalamish, “A guide to similarity measures and their data science applications,” Journal of Big Data, vol. 12, Art. no. 188, 2025. [Online]. Available: https://journalofbigdata.springeropen.com/ articles/10.1186/s40537-025-01227-1
Google Cloud, “Chirp 3 HD: High-quality, low-latency text- to-speech model,” Google Cloud Text-to-Speech, 2024. [Online]. Available: https://cloud.google.com/text-to- speech/docs/chirp3-hd
B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA- TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” arXiv preprint arXiv:1911.01255, 2019. [Online]. Available: https://arxiv.org/abs/1911.01255
H. Bredin, A. Laurent, and Y. Zhong, “End-to-end domain- adversarial voice activity detection,” Proc. Interspeech 2021, pp. 4658–4662, 2021. [Online]. Available: https:// www.isca-archive.org/interspeech_2021/bredin21_ interspeech.pdf
E. F. Krause, Taxicab Geometry: An Adventure in Non- Euclidean Geometry. New York, NY, USA: Dover Publications, 1986.
T. Souravlas, I. Roumeliotis, C. Roumeliotis, and C. Zissis, “Time series similarity measures and deep learning: State-of- the-art review,” arXiv preprint arXiv:2412.20574, 2024. [Online]. Available: https://arxiv.org/abs/2412.20574