Standardizing Arabic Dialects for NLP: A BERT-Based Transcoding Approach with a Focus on Moroccan Darija

Processing Arabic dialects in Natural Language Processing (NLP) presents significant challenges due to linguistic diversity and the lack of standardized resources.  While Modern Standard Arabic (MSA) benefits from advanced NLP tools and extensive annotated datasets, dialects such as Moroccan Darija remain underrepresented.  This study introduces a BERT-based transcoding framework that bridges the gap between dialectal Arabic and MSA, enabling the use of pre-trained models optimized for MSA, such as AraBERT.  By integrating contextual multilingual embeddings, the proposed approach preserves semantic accuracy while addressing the challenges of dialectal variation.  Experimental evaluations on the MAC dataset demonstrate the framework's effectiveness, with the proposed approach significantly outperforming existing models, including DarijaBERT and mBERT, across all key metrics.  The findings highlight the scalability of the framework, making it applicable to other Arabic dialects and broader NLP tasks.  This research advances Arabic language technology by providing a robust and scalable solution for dialectal NLP, particularly for sentiment analysis and similar downstream applications.

  1. Farghaly A., Shaalan K.  Arabic Natural Language Processing: Challenges and Solutions.  ACM Transactions on Asian Language Information Processing (TALIP).  8 (4), 14 (2009).
  2. Zribi I., Boujelbane R., Masmoudi A., Ellouze M., Belguith L., Habash N.  A Conventional Orthography for Tunisian Arabic.  Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14).  2355–2361 (2014).
  3. Antoun W., Baly F., Hajj H.  AraBERT: Transformer-based Model for Arabic Language Understanding.  Preprint arXiv:2003.00104 (2020).
  4. Devlin J., Chang M.-W., Lee K., Toutanova K.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.  Preprint arXiv:1810.04805 (2018).
  5. Salloum W., Habash N.  Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic.  North American Chapter of the Association for Computational Linguistics (2013).
  6. Abdul-Mageed M., Elmadany A., Nagoudi E. M. B.  ARBERT \& MARBERT: Deep Bidirectional Transformers for Arabic.  Preprint arXiv:2101.01785 (2020).
  7. Baniata L. H., Ampomah I. K. E., Park S.  A Transformer-Based Neural Machine Translation Model for Arabic Dialects That Utilizes Subword Units.  Sensors.  21 (19), 6509 (2021).
  8. Habash N., Diab M., Rambow O.  Conventional Orthography for Dialectal Arabic.  Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12).  711–718 (2012).
  9. Matrane Y., Benabbou F., Sael N.  Sentiment analysis through word embedding using AraBERT: Moroccan dialect use case.  2021 International Conference on Digital Age & Technological Advances for Sustainable Development (ICDATA).  80–87 (2021).
  10. Errami M., Ouassil M. A., Rachidi R., Cherradi B., Hamida S., Raihani A.  Investigating the Performance of BERT Model for Sentiment Analysis on Moroccan News Comments.  2023 3rd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET).  1–8 (2023).
  11. Bourahouat G., Abourezq M., Daoudi N.  Leveraging Moroccan Arabic Sentiment Analysis Using AraBERT and QARIB.  Innovations in Smart Cities Applications.  6, 299–310 (2023).
  12. Jbel M., Jabrane M., Imad H., Metrane A.  Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect.  Language Resources and Evaluation.  59, 1401–1430 (2025).
  13. Moussaoui O., El Younoussi Y.  Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT.  MENDEL.  29, 55–61 (2023).
  14. Alnajjar K., Hämäläinen M.  Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2.  Journal of Data Mining & Digital Humanities. NLP4DH, 1–8 (2024).
  15. Al Katat S., Zaki C., Hazimeh H., El Bitar I., Angarita R., Trojman L.  Natural Language Processing for Arabic Sentiment Analysis: A Systematic Literature Review.  IEEE Transactions on Big Data.  10 (5), 576–594 (2024).
  16. Alammary A. S.  BERT Models for Arabic Text Classification: A Systematic Review.  Applied Sciences.  12 (11), 5720 (2022).
  17. Gaanoun K., Naira A. M., Allak A., Benelallam I.  Darijabert: a Step Forward in Nlp for the Written Moroccan Dialect.  Preprint available at Research Square (2023).
  18. Atwany H., Rabih N., Mohammed I., Waheed A., Raj B.  OSACT 2024 Task 2: Arabic Dialect to MSA Translation.  Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024.  98–103 (2024).
  19. Baniata L. H., Park S., Park S.-B.  A Neural Machine Translation Model for Arabic Dialects That Utilises Multitask Learning (MTL).  Computational Intelligence and Neuroscience.  2018, 7534712 (2018).
  20. Sabty C., Islam M., Abdennadher S.  Contextual Embeddings for Arabic-English Code-Switched Data.  Proceedings of the Fifth Arabic Natural Language Processing Workshop.  215–225 (2020).
  21. Yagi S. M., Mansour Y., Kamalov F., Elnagar A.  Evaluation of Arabic-Based Contextualized Word Embedding Models.  2021 International Conference on Asian Language Processing (IALP).  200–206 (2021).
  22. Fares M.  AraT5-MSAizer: Translating Dialectal Arabic to MSA.  Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024.  124–129 (2024).
  23. Nacar O., Koubaa A.  Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning.  Generative AI and Large Language Models: Opportunities, Challenges, and Applications.  179–216 (2024).
  24. Allahim A., Cherif A.  Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation.  Applied Sciences.  14 (23), 11104 (2024).
  25. Garouani M., Kharroubi J.  MAC: An Open and Free Moroccan Arabic Corpus for Sentiment Analysis.  Innovations in Smart Cities Applications.  5, 849–858 (2022).
  26. Toporkov O., Agerri R.  On the Role of Morphological Information for Contextual Lemmatization.  Computational Linguistics.  50 (1), 157–191 (2023).
  27. Toporkov O., Agerri R.  Evaluating Shortest Edit Script Methods for Contextual Lemmatization.  Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).  6451–6463 (2024).