Kabyle POS Tagger
Part-of-Speech tagger for Kabyle (kab), a Berber language spoken in Algeria.
Model Details
| Attribute | Value |
|---|---|
| Base model | XLM-RoBERTa-base |
| Task | Token Classification (POS tagging) |
| Language | Kabyle (kab) |
| Training sentences | ~1,200 |
| Test F1 | 87.5% |
| Test Precision | 87.3% |
| Test Recall | 87.8% |
Dataset
Annotated following the Universal Dependencies POS tagset:
ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
Source: Tatoeba parallel corpus, semi-manually annotated.
Usage
from transformers import pipeline
tagger = pipeline(
"token-classification",
model="boffire/kabyle-pos",
aggregation_strategy="simple"
)
result = tagger("Aṭas n medden i yessen.")
Limitations
This model has several known limitations that users should be aware of:
Training data size: The model was trained on only ~1,200 sentences. This is small compared to high-resource language POS taggers, which may affect generalization to unseen vocabulary and syntactic constructions.
Pre-annotation noise: Approximately 22% of tokens in the training data were pre-annotated by heuristic rules and not fully manually verified. This introduces some label noise, particularly for rare words and complex cliticized forms.
Cliticized forms: The model struggles with complex cliticized verb forms (e.g.,
k-id-xeẓẓren,m-d-awiɣ,iyi-d-yewwi) that were not well-represented in the training data. These are often tagged asX(unknown).Domain bias: The training data comes from Tatoeba, which consists of short translated sentences. Performance may degrade on longer, more complex sentences from other domains (news, social media, literature).
Proper nouns: Capitalized words are heuristically tagged as
PROPN, which may misclassify sentence-initial common nouns or adjectives.Ambiguous tags: Some Kabyle words are genuinely ambiguous between tags (e.g.,
das ADP vs. CCONJ,yellaas AUX vs. VERB). The model may not consistently resolve these ambiguities.No diacritic normalization: The model treats
eandɛ,ḍanddas completely different characters. Misspelled or inconsistently diacritized text may perform poorly.Single annotator bias: The annotations were created by a single annotator without inter-annotator agreement verification, which may introduce systematic tagging biases.
Citation
@inproceedings{dione-etal-2023-masakhapos,
title = {MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages},
author = {Dione, Cheikh M. Bamba and Adelani, David Ifeoluwa and Nabende, Peter and Alabi, Jesujoba and others},
booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages = {10883--10900},
year = {2023},
publisher = {Association for Computational Linguistics},
url = {https://aclanthology.org/2023.acl-long.609},
doi = {10.18653/v1/2023.acl-long.609}
}
Side part of the Masakhane initiative for African NLP.
- Downloads last month
- 42
Evaluation results
- f1 on MasakhaPOS Kabyleself-reported0.875
- precision on MasakhaPOS Kabyleself-reported0.873
- recall on MasakhaPOS Kabyleself-reported0.878