POS Tagging without a Tagger: Using Aligned Corpora for Transferring Knowledge to Under-Resourced Languages

Ines Turki Khemakhem, Salma Jamoussi, Abdelmajid Ben Hamadou


Almost all languages lack sufficient resources and tools for developing Human Language Technologies (HLT). These latters are mostly concerned by languages for which large resources and tools are available.  In this paper, we will prove that under-resourced languages can benefit from these available resources and tools to develop their own HLT by taking as an example the which of the POS tagging Task that is among the most primordial Natural Language Processing tasks. Since, it assigns word tag to highlight its syntactic features by considering the corresponding contexts. The solution that we propose, in this research work, is based on the use of aligned parallel corpus as a bridge between a rich-resourced language and an under resourced language. This kind of corpus is usually available. The rich language side of this corpus is first annotated. These POS-annotations were then exploited to predict the annotation of under-resourced language side by using alignment training. After this training step, we obtain a matching table between the two languages which will be exploited to annotate an input text.  The experimentation of the proposed approach is performed on a couple of languages: English as a rich language and Arabic as an under resourced language. We used the IWSLT10 training corpus, and English Treetagger. The approach was evaluated on the test corpus extracted from the IWSLT08 obtain a F-score of 89% and can be extrapolated to the other NLP tasks.


POS tagging; alignment; parallel corpus; under-resourced languages.

Full Text: PDF