Optimization of the Data Labeling Process in Weakly Supervised Environments

Kostiantyn Minkov

The article investigates the problem of increasing the efficiency of the data labeling process in poorly controlled environments based on Active Learning methods. The relevance of the work is due to the rapid growth of unstructured and partially labeled data, the high cost of manual annotation, the shortage of qualified experts, and the negative impact of noise labels on the quality of machine learning models. Traditional approaches to forming training samples do not provide the optimal ratio between the quality of models, resource costs, and time characteristics, which necessitates the development of adaptive mechanisms for controlling the labeling process. A formal model of the annotation process is proposed, which includes the annotation cost function and a budget constraint that allows assessing the efficiency of sample selection. A system architecture is developed, which includes an uncertainty assessment module, a budget management module, an expert interface, a label validation module, and an Active Learning cycle manager. This structure ensures the integration of algorithmic, resource and expert components in a single controlled loop. The method is implemented using a modern technology stack (Python, PyTorch, FastAPI, PostgreSQL, Label Studio, MLflow) and microservice architecture with support for REST API and MLOps pipeline. The proposed approach provides scalability, reproducibility of experiments and the possibility of integration into industrial information systems. The results of experimental studies indicate the effectiveness of the method: an increase in Accuracy and F1-score indicators under limited budget conditions was achieved, labeling costs were reduced compared to random selection and classic Active Learning without budget optimization. It is shown that adaptive budget management and integration of label validation procedures can reduce the negative impact of noisy annotations and increase the economic feasibility of the training process. The results obtained indicate the practical suitability of the developed method for use in weakly supervised information systems and create a basis for further research in the direction of integrating reinforcement learning, multi-agent expert interaction models, and automated weak label generation.

cost-sensitive learning

Guan, N., Varma, M., Wu, S., et al. (2024). Weak supervision source evaluation with Shapley values. Proceedings of the VLDB Endowment. https://doi.org/10.14778/3717755.3717766
Hino, H., & Eguchi, S. (2023). Active learning by query by committee with robust divergences. Information Geometry, 6, 81–106. https://doi.org/10.1007/s41884-022-00081-x
Hsu, V. S., & Roberts, A. (2025). Active learning and adaptive annotation in crowdsourcing. Scientific Reports. https://doi.org/10.1038/s41598-024-68168-2
Huang, Y., Zhao, X., Cheng, X., et al. (2024). Adaptive graph active learning with mutual information via reinforcement learning. Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2024.123794
Kashima, K., Iiyama, K., & Kobayashi, S. (2024). Human-centered design of crowdsourcing systems for data annotation: A systematic literature review. Journal of Information Science. https://doi.org/10.1177/01655515231204802
Li, N., Tang, N., Ouyang, D., et al. (2024). Active learning for data quality control: A survey. ACM Computing Surveys. https://doi.org/10.1145/3663369
Martínez-Heredia, A. M., et al. (2025). Weak supervision: A survey on predictive maintenance. WIREs DataMining and Knowledge Discovery. https://doi.org/10.1002/widm.70022
Miller, K. S., & Bertozzi, A. L. (2024). Model change active learning in graph-based semi-supervised learning. Communications on Applied Mathematics and Computation, 6, 1270–1298. https://doi.org/10.1007/s42967-023-00328-z
Mosqueira-Rey, M., Hernández-Pereira, E., Alonso-Ríos, D., et al. (2023). Human-in-the-loop machine learning: A state of the art. Artificial Intelligence Review. https://doi.org/10.1007/s10462-022-10246-w
Nguyen, V. L., Shaker, M. H., & Hüllermeier, E. (2022). How to measure uncertainty in uncertainty sampling for active learning. Machine Learning, 111, 89–122. https://doi.org/10.1007/s10994-021-06003-9
Pandey, R., Purohit, H., Castillo, C., & Shalin, V. L. (2022). Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning. International Journal of Human-Computer Studies, 160, Article 102772. https://doi.org/10.1016/j.ijhcs.2022.102772
Smith, R., Fries, J. A., Hancock, B., & Bach, S. H. (2024). Incorporating prompting into weak supervision. ACM Journal of Data Science. https://doi.org/10.1145/3617130
Takezoe, R., Matsuno, R., Hayashi, S., et al. (2025). Deep active learning with a policy model for efficient annotation. ACM Journal of Data Science. https://doi.org/10.1145/3714413
Tharwat, A., & Schenck, W. (2023). A survey on active learning: State-of-the-art, practical challenges and research directions. Mathematics, 11(4), Article 820. https://doi.org/10.3390/math11040820
Varma, M., Ré, C., Wu, S., et al. (2022). NEMO: Data discovery and program synthesis for weak supervision. Proceedings of the VLDB Endowment. https://doi.org/10.14778/3565838.3565859
Wang, J., Wang, L., Zhang, Z., & Cao, P. (2025). Active learning-optimized dynamic sampling strategy for process modeling. Industrial & Engineering Chemistry Research. https://doi.org/10.1021/acs.iecr.5c02839
Wang, W., Feng, X., Huang, Z., et al. (2024). Deep active learning for medical image analysis: A survey. Medical Image Analysis. https://doi.org/10.1016/j.media.2024.103162
Ye, Q., Cai, T., Ji, X., et al. (2023). Subsequence and distant supervision based active learning for relation extraction of Chinese medical texts. BMC Medical Informatics and Decision Making, 23, Article 34. https://doi.org/10.1186/s12911-023-02127-1