Evaluating small quantized language models on apple silicon

Oleh Chaplia; Halyna Klym

This study examines the capabilities and limitations of small, 4-bit quantized language models that run locally on Apple Silicon. Four models have been benchmarked on a dataset of natural language prompts, with runtime metrics including inference time, memory usage, and token throughput, as well as output behavior. The study provides an empirical assessment of the feasibility of deploying language models on resource-constrained devices. The results highlight trade-offs of small language models and underscore the importance of model size, quantization, and prompt tuning in balancing performance, efficiency, and usability. Building on these insights, future work will extend evaluations to multi-turn agentic dialogues, analyze the semantic quality of output, and pursue further optimizations to enhance local inference performance.

cloud computing

microservices

software architecture

Small Language Models

Apple Silicon

MLX

MLX-LM

Patil, R. & V. Gudivada. (2024). A Review of Current Trends, Techniques, and Challenges in Large Language Models (LLMs). Applied Sciences, 14(5), 2074.https://doi.org/10.3390/app14052074
Yao, Y., J. Duan, K. Xu, Y. Cai, Z. Sun, & Y. Zhang. (2024). A survey on large language model (LLM) security and privacy: The Good, The Bad, and The Ugly. High-Confidence Computing, 4(2), 100211.https://doi.org/10.1016/j.hcc.2024.100211
Gallagher, S., B. Gelman, S. Taoufiq, T. Vörös, Y. Lee, A. Kyadige, & S. Bergeron. (2024). Phishing and Social Engineering in the Age of LLMs. Large Language Models in Cybersecurity: Threats, Exposure and Mitigation, 81- 86.https://doi.org/10.1007/978-3-031-54827-7_8
Piccialli, F., D. Chiaro, P. Qi, V. Bellandi, & E. Damiani. (2025). Federated and edge learning for large language models. Information Fusion, 117, 102840.https://doi.org/10.1016/j.inffus.2024.102840
Lamaakal, I., Y. Maleh, K. El Makkaoui, I. Ouahbi, P. Pławiak, O. Alfarraj, M. Almousa, & A. A. Abd El-Latif. (2025). Tiny Language Models for Automation and Control: Overview, Potential Applications, and Future Research Directions. Sensors, 25(5), 1318.https://doi.org/10.3390/s25051318
Jay, R. (2024). Building Different Types of Agents. Generative AI Apps with LangChain and Python: A Project-Based Approach to Building Real-World LLM Apps, 345-414.https://doi.org/10.1007/979-8-8688-0882-1_9
Chitty-Venkata, K. T., S. Mittal, M. Emani, V. Vishwanath,& A. K. Somani. (2023). A survey of techniques for optimi- zing transformer inference. Journal of Systems Architecture, 144, 102990.https://doi.org/10.1016/j.sysarc.2023.102990
Kassinos, S. & A. Alexiadis. (2025). Beyond language: Applying MLX transformers to engineering physics. Results in Engineering, 26, 104871.https://doi.org/10.1016/j.rineng.2025.104871
Kasperek, D., P. Antonowicz, M. Baranowski, M. Sokolowska, & M. Podpora. (2023). Comparison of the Usability of Apple M2 and M1 Processors for Various Machine Learning Tasks. Sensors, 23(12), 5424.https://doi.org/10.3390/s23125424
Berardi, D., S. Giallorenzo, A. Melis, M. Prandini, J. Mauro, & F. Montesi. (2022). Microservice security: a systematic literature review. PeerJ Computer Science, 7.https://doi.org/10.7717/peerj-cs.779
Chaplia, O. & H. Klym. (2023). Node.js project architecture with shared dependencies for microservices. Measuring Equipment and Metrology, 84(3), 53-58.https://doi.org/10.23939/istcmtm2023.03.053
Bližnák, K., M. Munk, & A. Pilková. (2024). A Systematic Review of Recent Literature on Data Governance (2017- 2023). IEEE Access, 12, 149875-149888.https://doi.org/10.1109/ACCESS.2024.3476373
Josa, A. D. & M. Bleda-Bejar. (2024). Local LLMs: Safeguarding Data Privacy in the Age of Generative AI. A Case Study at the University of Andorra. ICERI2024 Proceedings, 7879-7888.https://doi.org/10.21125/iceri.2024.1931