A Comparative Study of Inference Frameworks for Node.js Microservices on Edge Devices

Oleh Chaplia; Halyna Klym; Kateryna Babii

Deploying small language models (e.g., SLMs) on edge devices has become increasingly viable due to advancements in model compression and efficient inference frameworks. Running small models offers significant benefits, including privacy through on-device processing, reduced latency, and increased autonomy. This paper conducts a comparative review and analysis of Node.js inference frameworks that operate on-device. It evaluates frameworks in terms of performance, memory consumption, isolation, and deployability. The paper concludes with a discussion and decision matrix to guide developers toward optimal choices. This approach pushes microservices one step closer to becoming first-class intelligent services rather than clients of external AI.

microservices

Small Language Models

Edge Computing

artificial intelligence

benchmarking

distributed systems

[1] Patil, R. & V. Gudivada. (2024). A Review of Current Trends, Techniques, and Challenges in Large Language Models (LLMs). Applied Sciences, 14(5), 2074. DOI: 10.3390/app14052074.

[2] Blinowski, G., A. Ojdowska, & A. Przybylek. (2022). Monolithic vs. Microservice Architecture: A Performance and Scalability Evaluation. IEEE Access, 10, 20357–20374. DOI: 10.1109/ACCESS. 2022.3152803.

[3] Piccialli, F., D. Chiaro, P. Qi, V. Bellandi, & E. Damiani. (2025). Federated and edge learning for large language models. Information Fusion, 117, 102840. DOI: 10.1016/j.inffus.2024.102840.

[4] Bucher, M. J. J. & M. Martini. (2024). Fine-Tuned “Small” LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification. DOI: 10.48550/arXiv.2406.08660.

[5] Shoop, E., S. J. Matthews, R. Brown, & J. C. Adams. (2025). Hands-on parallel & distributed computing with Raspberry Pi devices and clusters. Journal of Parallel and Distributed Computing, 196, 104996. DOI: 10.1016/j.jpdc.2024.104996.

[6] Alizadeh, K., I. Mirzadeh, D. Belenko, K. Khatamifard, M. Cho, C. C. D. Mundo, M. Rastegari, & M. Farajtabar. (2024). LLM in a flash: Efficient Large Language Model Inference with Limited Memory. DOI: 10.48550/arXiv.2312.11514.

[7] Chaplia, O., H. Klym, & E. Elsts. (2024). Serverless AI Agents in the Cloud. Advances in Cyber-Physical Systems, 9(2), 115–120. DOI: https://doi.org/10.23939/acps2024.02.115.

[8] Chaplia, O., H. Klym, M. Konuhova, & A. I. Popov. (2025). Enhancing REST API Handlers Organization for Node.js Microservices. SN Computer Science, 6(7), 776. DOI: 10.1007/s42979-025-04311-8.

[9] Vake, D., J. Vičič, & A. Tošić. (2025). Hive: A secure, scalable framework for distributed Ollama inference. SoftwareX, 30, 102183. DOI:10.1016/ j.softx.2025.102183.

[10] Chen, F., L. Zhang, & X. Lian. (2021). A systematic gray literature review: The technologies and concerns of microservice application programming interfaces. Software: Practice and Experience, 51(7), 1483–1508. DOI: 10.1002/spe.2967.

[11] Dettmers, T., A. Pagnoni, A. Holtzman, & L. Zettlemoyer. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. DOI: 10.48550/ arXiv.2305.14314.

[12] Abdelfattah, A. S. & T. Cerny. (2023). Roadmap to Reasoning in Microservice Systems: A Rapid Review. Applied Sciences, 13(3), 1838. DOI: 10.3390/app13031838.

[13] López Espejel, J., M. S. Yahaya Alassan, M. Bouhandi, W. Dahhane, & E. H. Ettifouri. (2025). Low-cost language models: Survey and performance evaluation on Python code generation. Engineering Applications of Artificial Intelligence, 140, 109490. DOI: 10.1016/j.engappai.2024.109490.

[14] node-llama-cpp. Run AI models locally on your machine. Available at: https://node-llama- cpp.withcat.ai/.

[15] ONNX Runtime. Available at: https:// onnxruntime.ai/.

[16] Transformers.js. Available at: https://huggingface.co/docs/transformers.js/en/index.

[17] WasmEdge. Available at: https://wasmedge.org/.