: 216-223
Received: March 12, 2024
Revised: March 28, 2024
Accepted: April 01, 2024
Lviv Polytechnic National University
Lviv Polytechnic National University

Today, Artificial Intelligence is a daily routine, becoming deeply entrenched in our lives. One of the most popular and rapidly advancing technologies is speech recognition, which forms an integral part of the broader concept of multimodal data handling. Multimodal data encompasses voice, audio, and text data, constituting a multifaceted approach to understanding and processing information. This paper presents the development of a multimodal handling interface leveraging Google API technologies. The interface aims to facilitate seamless integration and management of diverse data modalities, including text, audio, and video, within a unified platform. Through the utilization of Google API functionalities, such as natural language processing, speech recognition, and video analysis, the interface offers enhanced capabilities for processing, analysing, and interpreting multimodal data. The paper discusses the design and implementation of the interface, highlighting its features and functionalities. Furthermore, it explores potential applications and future directions for utilizing the interface in various domains, including healthcare, education, and multimedia content creation. Overall, the development of the multimodal handling interface based on Google API represents a significant step towards advancing multimodal data processing and enhancing user experience in interacting with diverse data sources.

[1] Karpathy and L. Fei-Fei, “Deep visual-semantic alignmentsfor generating image descriptions,” in Proceedings of the IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR), 2015, pp. 3128–3137

[2] Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen,and Tan Lee, “Editspeech: A text based speech editing systemusing partial inference and bidirectional fusion,” arXiv preprintarXiv:2107.01554, 2021.

[3] M. Oncescu, A. S. Koepke, J. F. Henriques, Z. Akata, andS. Albanie, “Audio Retrieval with Natural Language Queries,”in Proceedings of Conference of the International Speech Com-munication Association, 2021, pp. 2411–2415.

[4] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and YoshuaBengio, Deep learning, vol. 1, MIT press Cambridge, 2016

[5] Ivan Izonin, et. al., "The Combined Use of the Wiener Polynomial and SVM for Material Classification Task in Medical Implants Production", International Journal of Intelligent Systems and Applications (IJISA), Vol.10, No.9, pp.40-47, 2018.

[6] Havryliuk, M., Dumyn, I., Vovk, O. (2023). Extraction of Structural Elements of the Text Using Pragmatic Features for the Nomenclature of Cases Verification. In: Hu, Z., Wang, Y., He, M. (eds) Advances in Intelligent Systems, Computer Science and Digital Economics IV. CSDEIS 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 158. Springer, Cham.

[7] Vitaly Yakovyna, Natalya Shakhovska, "Software failure time series prediction with RBF, GRNN, and LSTM neural networks", Procedia Computer Science 207(4):837-847,

[8] Nataliya Shakhovska, et. al.: "The Developing of the System for Autimatic Audio to Text Conversion", IT&AS’2021: Symposium on Information Technologies and Applied Sciences, March 5–6, 2021, Bratislava, Slovak Republic.

[9] uxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, EricBattenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif ASaurous, “Style tokens: Unsupervised style modeling, controland transfer in end-to-end speech synthesis,” in InternationalConference on Machine Learning. PMLR, 2018, pp. 5180–5189.

[10] Nataliya Boyko, et. al.: "Usage of Machine-based Translation Methods for Analyzing Open Data in Legal Cases". In: Proc. of the CybHyg-2019, Kyiv, Ukraine, November 30, 2019, pp. 328–338.

[11] Berezsky O., Verbovyy S., Pitsun O. Hybrid Intelligent information techology for biomedical image processing. Proceedings of the IEEE International Conference «Computer Science and Information Technologies» CSIT’2018, Lviv. Ukraine, 11-14 September, 2018. Р. 420-423.  ї

[12] Zoryana Rybchak, et. al. "Analysis of methods and means of text mining".  ECONTECHMOD, 6(2), 2017, pp. 73-78.

[13] P. Zdebskyi, V. Lytvyn,Y. Burov, and et. Intelligent system for semantically similar sentences identification and generation based on machine learning methods, CEUR Workshop Proceedings, 2020, pp. 317–346.

[14] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and MingLiu, “Neural speech synthesis with transformer network,” inProceedings of the AAAI Conference on Artificial Intelligence,2019, vol. 33, pp. 6706–6713.

[15] Oleh Basystiuk, Nataliia Melnykova "Multimodal Approaches for Natural Language Processing in Medical Data" Proceedings of the 5th International Conference on Informatics & Data-Driven Medicine, Lyon, France, November 18 - 20,, 2022. pp. 246-252

[16] N. Shakhovska, N. Boyko, P. Pukach. The Information Model of Cloud Data Warehouses International Conference on Computer Science and Information Technologies, CSIT 2018, September 11-14, Lviv, Ukraine, 2019, pp. 182-191.

[17] ifa Sun, Kun Li, Hao Wang, Shiyin Kang, and Helen Meng,“Phonetic posteriorgrams for many-to-one voice conversionwithout parallel data training,” in 2016 IEEE InternationalConference on Multimedia and Expo (ICME). IEEE, 2016, pp.1–6.

[18] S. Chowdhury and J. Sil, "FACERECOGNITION from NON-FRONTALIMAGES Using DEEP NEURALNETWORK," in 2017 Ninth InternationalConference on Advances in PatternRecognition (ICAPR), 2017, pp. 1-6.

[19] Z. Rybchak, O. Basystiuk, Analysis of computer vision and image analysis technics, ECONTECHMOD: an international quarterly journal on economics of technology and modelling processes, Lublin, Poland, 2017, pp. 79-84.

[20] I. Zheliznyak, Z. Rybchak, I. Zavuschak, Analysis of clustering algorithms, 2017. Advances in Intelligent Systems and Computing, 2017, pp. 305–314.