Building scalable and reliable machine learning models is critical for cloud-native AI systems. Kubeflow provides a robust framework for orchestrating model development workflows. This article presents best practices for ML model development in Kubeflow Cloud-Native Systems, with a focus on Azure Kubernetes Service environ- ments. It explores strategies for optimizing cluster configuration, designing modular and reproducible training pipelines, and implementing effective model tracking and ver- sioning processes. Real-world case studies highlight practical applications of these techniques. The article further evaluates results through system performance metrics and model development outcomes and concludes by discussing lessons learned and future trends in cloud-native ML system design.
- Jha, P., Biswas, T., Sagar, U., & Ahuja, K. (2021). Prediction with ML paradigm in Healthcare System. 2021 Second international conference on electronics and sus- tainable communication systems, 1334–1342. DOI:https:// doi.org/10.1109/ICESC51422.2021.9532752
- Bershchanskyi, Y., & Klym, H. (2023). Information System for Administration of Medical Institution. 13th International Conference on Dependable Systems, Services and Technologies, 1–4. DOI: https://doi.org/10.1109/ DESSERT61349.2023.10416537
- Chaplia, O., Klym, H., & Elsts, E. (2024). Serverless AI agents in the cloud. Advances in Cyber-Physical Systems, 9(2), 115–120. DOI: https://doi.org/10.23939/ acps2024.02.115
- Zheng, C., Kremer-Herman, N., Shaffer, T., & Thain, D. (2020). Autoscaling high-throughput workloads on container orchestrators. IEEE International Conference on Cluster Computing, 142–152. DOI: https://doi.org/ 10.1109/CLUSTER49012.2020.00024
- Karkazis, P., Uzunidis, D., Trakadas, P., & Leligou, H. C. (2022). Design challenges on machine-learning enabled resource optimization. IT Professional, 24(5), 69–74. DOI: https://doi.org/10.1109/MITP.2022.3194129
- Sandha, S. S., Aggarwal, M., Saha, S. S., & Srivastava, M. (2021). Enabling hyperparameter tuning of machine learning classifiers in production. IEEE third international conference on cognitive machine intelligence, 262–271,DOI:https://doi.org/10.1109/CogMI52975.2021.00041
- Bershchanskyi, Y., Klym, H., & Shevchuk, Y. (2024). Containerized artificial intelligent system design in cloud and cyber-physical systems. Advances in Cyber-Physical Systems, 9(2), 151–157. DOI:https://doi.org/10.23939/acps2024.02.151
- Johansson, B., Rågberger, M., Nolte, T., & Papadopoulos, A. V. (2022). Kubernetes orchestration of high availability distributed control systems. International Conference on Industrial Technology, 1–8, DOI:https://doi.org/10.1109/ICIT48603.2022.10002757
- Rostami, G. (2023). Role-based access control (rbac) authorization in kubernetes. Journal of ICT Standardization, 11(3), 237–260. DOI: https://doi.org/10.13052/jicts2245-800X.1132
- Mbata, A., Sripada, Y., & Zhong, M. (2024). A survey of pipeline tools for data engineering. arXiv preprint arXiv:2406.08335.DOI:https://doi.org/10.48550/arXiv.2406.08335
- Woźniak, A. P., Milczarek, M., & Woźniak, J. (2025). MLOps Components, Tools, Process and Metrics-A Systematic Literature Review.IEEE Access. (13), 22166–22175. DOI:https://doi.org/10.1109/ACCESS.2025.3534990
- Liu, P., & Guitart, J. (2022). Fine-grained scheduling for containerized HPC workloads in kubernetes clusters. 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City, 275–284. DOI: https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00068
- Na, J. H., Yu, H. J., Kang, H., Kang, H., Lim, H. D., Shin, J. H., & Noh, S. Y. (2024). PVA: The Persistent Volume Autoscaler for Stateful Applications in Kubernetes. IEEE Access. (12), 179130–179143 DOI: https://doi.org/10.1109/ACCESS.2024.3507194
- [Gill, K. S., Anand, V., Chauhan, R., Rawat, R., & Hsiung, P. A. (2023). Utilization of Kubeflow for Deploying Machine Learning Models Across Several Cloud Providers. 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON), 1–7. DOI: https://doi.org/10.1109/ SMARTGENCON60755.2023.10442069