similarity coefficient 1. Introduction

IMPACT OF AUDIO SIGNAL DURATION ON THE ACCURACY OF SPEAKER VOICE IDENTIFICATION

This paper investigates the capability of a system based on voice embeddings to identify speakers. We use a set of audio recordings from five speakers and construct clips of varying durations – 5 to 600 seconds. Pyannote-audio embeddings are extracted by a neural network, after which similarity coefficients are computed between embeddings of clips from the same speaker (intra-speaker similarity) and from different speakers (inter-speaker dissimilarity).