Improving Amazigh POS tagging using machine learning

S. Amri; R. Bani; L. Zenkouar; Z. Guennoun

Tamazight, Berber, and Amazigh are the multiple names for the same language. It covers a great geographical area including the north of Africa, Sahara Sahel. It is spread principally in Morocco, Algeria, Tunisia, and Mali. In terms of natural language processing, it is considered a low-resource language. This paper presents multiple applications of different machine learning algorithms for part-of-speech tagging Amazigh for the first time. Those algorithms include trigrams 'n' tags (TnT), Brill tagging, hidden Markov model (HMM), Unigram, Bigram, Unigram + Bigram,and conditional random fields (CRF). Also, we present a part-of-speech tagger using CRF with our function of extracting features from the Amazigh language. The importance of finding a performant POS tagger for the Amazigh is to enrich its corpus, which is a main step for other NLP applications. In this research, we used 60000 tokens of annotated Amazigh corpus with 28 tags, and we realized the necessary processing step on it to be in an adequate form for feeding each model. A detailed comparison of the performance results is presented to establish the best one and the results show that our application of CRF model outperforms other techniques.

Brants T. TnT – A Statistical Part-of-Speech Tagger. Preprint arXiv:cs/0003055 (2000).
Lamport L., Lafferty J., McCallum A., Pereira F. C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001).
Baum L. E., Petrie T. Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics. 37 (6), 1554–1563 (1966).
Ratnaparkhi A. A maximum entropy model for part-of-speech tagging. Conference on Empirical Methods in Natural Language Processing (1996).
Spoustová D., Hajič J., Votrubec J., Krbec P., Květoň P. The best of two worlds: Cooperation of statistical and rule-based taggers for Czech. Proceedings of the Workshop on Balto–Slavonic Natural Language Processing: Information Extraction and Enabling Technologies. 67–74 (2007).
Greenberg J. H. The Languages of Africa. The Hague (1966).
Ouakrim O. Fonética y fonología del Bereber. Servei de Publicacions de la Universitat Autònoma de Barcelona (1995).
Amri S., Zenkouar L., Outahajala M. Build a Morphosyntaxically Annotated Amazigh Corpus. BDCA'17: Proceedings of the 2nd international Conference on Big Data, Cloud and Applications. 1–7 (2017).
Cutting D., Kupiec J., Pedersen J., Sibun P. A practical part-of-speech tagger. ANLC'92: Proceedings of the third conference on Applied natural language processing. 133–140 (1992).
Toutanova K., Klein D., Manning C. D., Singer Y. Feature-rich part-of-speech tagging with a cyclic dependency network. NAACL'03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 173–180 (2003).
Giménez J., Màrquez L. Fast and accurate part-of-speech tagging: The SVM approach revisited. Recent Advances in Natural Language Processing III: Selected papers from RANLP 2003. 153–163 (2003).
Giménez J., Màrquez L. SVMTool: A general POS tagger generator based on Support Vector Machines. Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04). (2004).
Constant M., Sigogne A. MWU-aware part-of-speech tagging with a CRF model and lexical resources. Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. 49–56 (2011).
Priyadarshi A., Saha S. K. Towards the first Maithili part of speech tagger: Resource creation and system development. Computer Speech & Language. 62, 101054 (2020).
Antony P. J., Mohan S. P., Soman K. P. SVM based part of speech tagger for Malayalam. 2010 International Conference on Recent Trends in Information, Telecommunication and Computing. 339–341 (2010).
Anwar W., Wang X., Li L., Wang X. L. A statistical based part of speech tagger for Urdu language. 2007 International Conference on Machine Learning and Cybernetics. 3418–3424 (2007).
Sajjad H., Schmid H. Tagging Urdu text with parts of speech: a tagger comparison. EACL'09: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. 692–700 (2009).
Amri S., Zenkouar L., Outahajala M. A Comparison of Three Machine Learning Methods for Amazigh POS Tagging. International Journal of Scientific and Engineering Research. 8 (2), 83–87 (2017).
Schmidt H. Probabilistic part-of-speech tagging using decision trees. Proceedings of International Conference on New Methods in Language Processing. 44–49 (1994).
Amri S., Zenkouar L., Outahajala M. Amazigh part-of-speech tagging using Markov models and decision trees. International Journal of Computer Science & Information Technology (IJCSIT). 8 (5), 61–71 (2016).
Henrich V., Reuter T., Loftsson H. CombiTagger: A System for Developing Combined Taggers. Proceedings of the Twenty-Second International FLAIRS Conference (2009).
Amri S., Zenkouar L., Outahajala M. Combination POS taggers on Amazigh texts. 2017 3rd International Conference of Cloud Computing Technologies and Applications (CloudTech). 1–6 (2017).
Baum L. E., Petrie T., Soules G., Weiss N. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics. 41 (1), 164–171 (1970).
Markov A. A. Essai d'une recherche statistique sur le texte du roman ''Eugene Onegin'' illustrant la liaison des epreuve en chain. Bulletin de l'Académie Impériale des Sciences de St.-Pétersbourg. VI serie. 7, 153–162 (1913).
Boukhris F., Boumalk A., El Houssaïn El Moujahid, Souifi H. La Nouvelle Grammaire de L'amazighe. Publications de l'Institut Royal de la Culture Amazighe (2008).