IoT system for real-time audio information processing

Oleh Osadchuk; Igor Olenych

This paper presents the development and inves- tigation of a speech-to-text conversion and speaker identi- fication system based on a Raspberry Pi microcomputer, designed for local audio data processing in environments with limited network connectivity. The system integrates Silero and WebRTC models for voice activity detection, SpeechBrain for speaker identification, and the Whisper family of models for speech recognition. In particular, a comparative analysis has been conducted on the efficiency of local speech processing using Whisper Tiny and Whisper Large 2 models versus cloud-based processing through the Whisper-1 and Whisper-1-en APIs (the latter applied exclu- sively to English-language speech). The study evaluates the impact of sentence length, processing time, memory consum- ption, and recognition accuracy on system performance. The advantages and resource-related limitations of the models in local and cloud-based IoT environments has been analyzed, and the feasibility of their application in real-time and data privacy contexts has been determined. Performance metrics of the models under various conditions has been used for the analysis.

raspberry pi

IoT

speech-to-text conversion

speaker identification

Whisper models

SpeechBrain.

Sarbast, H. (2024). Voice Recognition Based on Machine Learning Classification Algorithms: A Review. Indonesian Journal of Computer Science, 13, 4414-4431.https://doi.org/10.33022/ijcs.v13i3.4110
Fatima, I., Fahim, M., Lee, Y.K., & Lee, S. (2013). Analysis and Effects of Smart Home Dataset Characteristics for Daily Life Activity Recognition. The Journal of Supercomputing, 66, 760-780.https://doi.org/10.1007/s11227-013-0978-8
Luo, X., Zhou, L., Adelgais, K.M., & Zhang, Z. (2024). Assessing the Effectiveness of Automatic Speech Recognition Technology in Emergency Medicine Settings: A Comparative Study of Four AI-powered Engines.https://doi.org/10.21203/rs.3.rs-4727659/v1
Wang, X. (2024). Research on Oral English Learning System Integrating AI Speech Data Recognition and Speech Quality Evaluation Algorithm. Journal of Electrical Systems, 20, 2466-2477.https://doi.org/10.52783/jes.2688
Thandil, R.K., & Basheer, K.P.M. (2020). Accent Based Speech Recognition: A Critical Overview. Malaya Journal of Matematik, 8, 1743-1750.https://doi.org/10.26637/MJM0804/0070
Subhi, H., Qashi, R., Abdulrahman, L.M., Ayoub, M. & Adil, A. (2023). Performance Analysis of Enterprise Cloud Computing: A Review. Journal of Applied Science and Technology Trends, 4, 1-12.https://doi.org/10.38094/jastt401139
Sikarwar, S.S. (2025). Computation Intelligence Techniques for Security in IoT Devices. International Journal on Computational Modelling Applications, 2(1), 15-27.https://doi.org/10.63503/j.ijcma.2025.48
Abnas, M., Imkan, K. M., Ajmal, J.S., Vasudevan A.P., Thampi, S., & Philip, R.K. (2024). Colloquial Language Speech Converter API: A Comprehensive Survey.https://doi.org/10.20944/preprints202412.2503.v1
Balan, R.V.S., Vignesh, K., Jose, T., Kalpana, P., & Jothi-kumar, R. (2024). An Investigation and Analysis on Automatic Speech Recognition Systems. Journal of Autonomous Intelligence, 7(3), 1-13.https://doi.org/10.32629/jai.v7i3.1060
Cheng, S., Xu, Z., Li, X., Wu, X., Fan, Q., Wang, X., & Leung, V.C.M. (2020). Task Offloading for Automatic Speech Recognition in Edge-Cloud Computing Based Mobile Networks. 2020 IEEE Symposium on Computers and Communications (ISCC), 1-6.https://doi.org/10.1109/ISCC50000.2020.9219579
Toshniwal, S., Sainath, T.N., Weiss, R.J., Li, B., Moreno, P., Weinstein, E., & Rao, K. (2018). Multilingual Speech Recognition with a Single End-to-End Model. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4904-4908.https://doi.org/10.1109/ICASSP.2018.8461972
Orellana, C., Cereceda-Balic, F., Solar, M., & Astudillo, H. (2024). Enabling Design of Secure IoT Systems with Trade-Off-Aware Architectural Tactics. Sensors, 24(22), 7314.https://doi.org/10.3390/s24227314