Improving Amazigh POS tagging using machine learning

Tamazight, Berber, and Amazigh are the multiple names for the same language. It covers a great geographical area including the north of Africa, Sahara Sahel.  It is spread principally in Morocco, Algeria, Tunisia, and Mali.  In terms of natural language processing, it is considered a low-resource language.  This paper presents multiple applications of different machine learning algorithms for part-of-speech tagging Amazigh for the first time.  Those algorithms include trigrams 'n' tags (TnT), Brill tagging, hidden Markov model (HMM), Unigram, Bigram, Unigram + Bigram,and conditional random fields (CRF).  Also, we present a part-of-speech tagger using CRF with our function of extracting features from the Amazigh language.  The importance of finding a performant POS tagger for the Amazigh is to enrich its corpus, which is a main step for other NLP applications.  In this research, we used 60000 tokens of annotated Amazigh corpus with 28 tags, and we realized the necessary processing step on it to be in an adequate form for feeding each model.  A detailed comparison of the performance results is presented to establish the best one and the results show that our application of CRF model outperforms other techniques.

