Kabyle POS Tagger

Part-of-Speech tagger for Kabyle (kab), a Berber language spoken in Algeria.

Model Details

Attribute	Value
Base model	XLM-RoBERTa-base
Task	Token Classification (POS tagging)
Language	Kabyle (kab)
Training sentences	~1,200
Test F1	87.5%
Test Precision	87.3%
Test Recall	87.8%

Dataset

Annotated following the Universal Dependencies POS tagset: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X

Source: Tatoeba parallel corpus, semi-manually annotated.

Usage

from transformers import pipeline

tagger = pipeline(
    "token-classification",
    model="boffire/kabyle-pos",
    aggregation_strategy="simple"
)

result = tagger("Aṭas n medden i yessen.")

Limitations

This model has several known limitations that users should be aware of:

Training data size: The model was trained on only ~1,200 sentences. This is small compared to high-resource language POS taggers, which may affect generalization to unseen vocabulary and syntactic constructions.
Pre-annotation noise: Approximately 22% of tokens in the training data were pre-annotated by heuristic rules and not fully manually verified. This introduces some label noise, particularly for rare words and complex cliticized forms.
Cliticized forms: The model struggles with complex cliticized verb forms (e.g., k-id-xeẓẓren, m-d-awiɣ, iyi-d-yewwi) that were not well-represented in the training data. These are often tagged as X (unknown).
Domain bias: The training data comes from Tatoeba, which consists of short translated sentences. Performance may degrade on longer, more complex sentences from other domains (news, social media, literature).
Proper nouns: Capitalized words are heuristically tagged as PROPN, which may misclassify sentence-initial common nouns or adjectives.
Ambiguous tags: Some Kabyle words are genuinely ambiguous between tags (e.g., d as ADP vs. CCONJ, yella as AUX vs. VERB). The model may not consistently resolve these ambiguities.
No diacritic normalization: The model treats e and ɛ, ḍ and d as completely different characters. Misspelled or inconsistently diacritized text may perform poorly.
Single annotator bias: The annotations were created by a single annotator without inter-annotator agreement verification, which may introduce systematic tagging biases.

Citation

@inproceedings{dione-etal-2023-masakhapos,
  title = {MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages},
  author = {Dione, Cheikh M. Bamba and Adelani, David Ifeoluwa and Nabende, Peter and Alabi, Jesujoba and others},
  booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages = {10883--10900},
  year = {2023},
  publisher = {Association for Computational Linguistics},
  url = {https://aclanthology.org/2023.acl-long.609},
  doi = {10.18653/v1/2023.acl-long.609}
}

Side part of the Masakhane initiative for African NLP.

Downloads last month: 42

Safetensors

Model size

0.3B params

Tensor type

F32

Evaluation results

f1 on MasakhaPOS Kabyle
self-reported

0.875
precision on MasakhaPOS Kabyle
self-reported

0.873
recall on MasakhaPOS Kabyle
self-reported

0.878