The problem of high-performance and efficient order scheduling is a common combinatorial optimization problem in various industrial contexts. Creation of a model capable of generating schedules balanced in terms of quality and computational time poses a significant challenge due to the large action space. This study proposes a high-performant environment and a reinforcement learning model for allocating orders to resources using a mechanism of invalid action masking. The developed reinforcement learning solution overcomesthe limitations of traditional heuristic and exact methods regarding computation performance and efficiency. The research included the design of a Gymnasium-compatible simulation environment, performance benchmarking, development of optimized environment state updating procedures, feature generation strategies, and evaluation of PPO and MaskablePPO models. The environment implemented incremental updates of the features state and action masks with extensive NumPy vectorization, significantly reducing computational overhead and improving compatibility with deep learning policies. The invalid action masking replaced penalty-based constraints by mandatory restrictions on the agent’s action space to feasible decisions, thus enhancing the policy’s accuracy by focusing on valid and more optimal choices. Datasets containing up to 500 orders were generated, on which PPO and MaskablePPO models from the Stable-Baselines3 library were trained. Each model was trained for 100,000 iterations. Training progress was monitored using TensorBoard. The masked version required 1.49 minutes for training, while the unmasked model completed training in 1.2 minutes. For Masked PPO, the average per-step penalty was 2.41, while for PPO it was 325,000. These results demonstrated that the standard PPO frequently selected invalid actions, collecting heavy penalties, whereas the MaskedPPO accumulated penalties solely related to the schedule length. As a result, on the test dataset, MaskedPPO completed the schedule calculation in 0.18 seconds, producing a schedule with a total duration of 4.590 minutes, compared to 5.4 seconds and 5,127 minutes for standard PPO, which made invalid action attempts in 96 % of the cases. It wasfound that action masking significantly improved the model’s accuracy and convergence despite a slightly longer training time. The results reveal the strong potential of reinforcement learning approaches for order scheduling and combinatorial optimization problems in general. The proposed Masked PPO model resulted in reduced order allocation time compared to the traditional exact CP-SAT method while maintaining higher schedule quality than SPT heuristic approache on problems with 50500 jobs. This work established the foundation for future research involving more complex model architectures based on Set Transformers, Graph Neural Networks, and Pointer Networks. These enable effective generalization and allow the trained policies to be applied to problem instances with higher input dimensions than those seen during training.
[1] K. Li, T. Zhang, R. Wang, Y. Wang, Y. Han and L. Wang (Dec. 2022). Deep Reinforcement Learning for Combinatorial Optimization: Covering Salesman Problems, in IEEE Transactions on Cybernetics, 52(12), 13142–13155. https://doi.org/ 10.1109/TCYB.2021.3103811
[2] Kim, H., Kim, Y.-J., & Kim, W.-T. (2024). Deep reinforcement learning-based adaptive scheduling for wireless time-sensitive networking. Sensors, 24(16), 52–81. https://doi.org/ 10.3390/s24165281
[3] Cheng, Y., Huang, L., & Wang, X. (2022). Authentic Boundary Proximal Policy Optimization. IEEE Transactions on Cybernetics, 52(9), 9428–9438. https://doi.org/10.1109/ TCYB. 2021.3051456
[4] Zhang, T., Banitalebi-Dehkordi, A., & Zhang, Y. (2022, August). Deep reinforcement learning for exact combinatorial optimization: Learning to branch. In 2022 26th International Conference on Pattern Recognition (ICPR) (pp. 3105–3111). IEEE. https://doi.org/10.1109/ICPR56361. 2022.9956256
[5] Zhang, Y., Zhang, Z., & Zhang, L. (2020). Implementing action mask in proximal policy optimization (PPO) algorithm. Procedia Computer Science, 176, 2749–2758. https://doi.org/ 10.1016/j.procs.2020.09.122
[6] Wang, Z., Li, X., Sun, L., Zhang, H., Liu, H., & Wang, J. (2024). Learning State-Specific Action Masks for Reinfor- cement Learning. Algorithms, 17(2), 60. https:// doi.org/ 10.3390/ a17020060
[7] Jung, M., Lee, J., & Kim, J. (2024). A lightweight CNN- transformer model for learning traveling salesman problems. Applied Intelligence, 54, 7982–7993. https://doi.org/10.1007/ s10489-024-05603-x
[8] Waubert de Puiseau, C., Wolz, F., Montag, M., Peters, J., Tercan, H., & Meisen, T. (2025). Applying Decision Transformers to Enhance Neural Local Search on the Job Shop Scheduling Problem. AI, 6(3), 48. https://doi.org/ 10.3390/ai6030048
[9] Krishnan, S., Boroujerdian, B., Fu, W., Chen, Y., Sharma, P., & Bindel, D. (2021). Air Learning: A deep reinforcement learning gym for autonomous aerial robot visual navigation. Machine Learning, 110(9), 2501–2540. https://doi.org/ 10.1007/ s10994-021-06006-6
[10] Han, B., & Yang, J.-J. (2021). A deep reinforcement learning based solution for flexible job shop scheduling problem. International Journal of Simulation Modelling, 20(2), 375– 386. https://doi.org/10.2507/IJSIMM20-2-CO7
[11] Zhang, X., Wang, Y., & Wang, J. (2022). Entropy regularized reinforcement learning with policy gradient. Information Sciences, 607, 1063–1079. https://doi.org/10.1016/j.ins. 2022.06.057
[12] Eschmann, J. (2021). Reward function design in reinforcement learning. In Reinforcement Learning Algo- rithms: Analysis and Applications (pp. 25–33). Springer. https://doi.org/10.1007/ 978-3-030-41188-6_3
[13] Hou, Y., Liang, X., Zhang, J., Yang, Q., Yang, A., & Wang, N. (2023). Exploring the use of invalid action masking in reinforcement learning: A comparative study of on-policy and off-policy algorithms in real-time strategy games. Applied Sciences, 13(14), 82–83. https://doi.org/ 10.3390/ app13148283
[14] Sahu, A., Venkatraman, V., & Macwan, R. (2023). Reinforcement learning environment for cyber-resilient power distribution system. IEEE Access, 11, 127216–127228. https://doi.org/10.1109/ACCESS.2023.3282182