DATA PREPARATION STRATEGIES IN KUBEFLOW FOR CLOUD-NATIVE AI SYSTEMS

1
Lviv Polytechnic National University
2
Lviv Politechnic National University, Ukraine

This article presents the main findings from an in-depth study of data preparation strategies using Kubeflow in cloud-native AI systems deployed on Azure Kubernetes Service. The results demonstrate that integrating Kubeflow Pipelines with Azure-native tools enables scalable and automated processing of large datasets, significantly improving training efficiency and model accuracy. The use of TensorFlow Data Validation proved effective in detecting schema anomalies and data drift, enhancing data reliability across iterative ML workflows. A case study confirms that the implemented pipeline reduced data processing time by 35% and increased pipeline reproducibility through integrated metadata tracking and data versioning. These outcomes highlight Kubeflow’s practical value in supporting efficient, traceable, and production-ready AI pipelines in enterprise-grade cloud environments.

  1. Bershchanskyi, Y. and Klym, H., 2023, October. Information System for Administration of Medical Institution. In 2023 13th International Conference on Dependable Systems, Services and Technologies (DESSERT) (pp. 1-4). IEEE. https://doi.org/10.1109/ DESSERT61349.2023.10416537
  2. Mehendale, P., 2023. Model Reliability and Performance through MLOps: Tools and Methodologies. J Artif Intell Mach Learn & Data Sci 2023, 1(4), pp.980-984. https://doi.org/10.51219/JAIMLD/pushkar
  3. Abbas, T. and Eldred, A., 2025. AI-Powered Stream Processing: Bridging Real-Time Data Pipelines with Advanced Machine Learning Techniques. ResearchGate Journal of AI & Cloud Analytics. https://doi.org/10.13140/ RG.2.2.26674.52167
  4. Yuan, D.Y. and Wildish, T., 2020, June. Bioinformatics application with kubeflow for batch processing in clouds. In International conference on high performance computing (pp. 355-367). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-59851-8_24
  5. Subramaniam, A. and Subramaniam, A., 2023, October. Automated Resource Scaling in Kubeflow through Time Series Forecasting. In 2023 IEEE 5th International Conference on Cybernetics, Cognition and Machine Learning Applications (ICCCMLA) (pp. 173-179). IEEE. https://doi.org/10.1109/ICCCMLA58983.2023.10346870
  6. Josyula, P., Ulaganathan, S. and Arava, S.K., 2025, February. A Survey of Federated Learning Orchestration Using Kubeflow: Challenges, Advances, and Future Directions. In 2025 First International Conference on Advances in Computer Science, Electrical, Electronics, and Communication Technologies (CE2CT) (pp. 566-572). IEEE. https://doi.org/10.1109/CE2CT64011.2025.10939611
  7. Bershchanskyi, Y., Klym, H. and Shevchuk, Y., 2024. Containerized artificial intelligent system design in cloud and cyber-physical systems., Advances in Cyber-Physical 
    Systems (ACPS) 2024; Volume 9, Number 2 pp. 151 – 157. https://doi.org/10.23939/acps2024.02.151
  8. Yadavalli, T., Optimizing Machine Learning Workflows with Google Cloud Dataflow and TensorFlow Extended (TFX). J Artif Intell Mach Learn & Data Sci 2021, 1(1), pp.2436-2441. https://doi.org/10.51219/JAIMLD/tulasiram- yadavalli/524
  9. Kienzler, R. and Kyas, H., 2020, January. Tensorflow 2.0 and Kubeflow for Scalable and Reproducable Enterprise AI. In CS & IT Conference Proceedings (Vol. 10, No. 1). CS & IT Conference Proceedings. [Online]. Available: https://csitcp.com/paper/10/101csit03.pdf
  10. Caveness, E., GC, P.S., Peng, Z., Polyzotis, N., Roy, S. and Zinkevich, M., 2020, June. Tensorflow data validation: Data analysis and validation in continuous ml pipelines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (pp. 2793-2796). https://doi.org/10.1145/3318464.3384707 
  11. Devarasetty, N., 2024. Optimizing Data Engineering for AI: Improving Data Quality and Preparation for Machine Learning Application. The Computertech, pp.1-28. https://doi.org/10.18535/raj.v7i03.397
  12. Teodoras, D.A., Stalidi, C., Popovici, E.C. and Suciu, G., 2024. Implementing a Java Microservice for Credit Fraud Detection Using Machine Learning. In 2024 23rd RoEduNet Conference: Networking in Education and Research (RoEduNet) (pp. 1-5). IEEE. https://doi.org/10.1109/RoEduNet64292.2024.10722691
  13. Bershchanskyi, Y. and Klym, H., 2024, October. Development Approaches of Cloud-Based System for Object Recognition on Images. In 2024 IEEE 17th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering (TCSET) (pp. 205-208). IEEE. https://doi.org/10.1109/TCSET64720.2024.10755838