Kabyle POS Tagger

Part-of-Speech tagger for Kabyle (kab), a Berber language spoken in Algeria.

Model Details

Attribute Value
Base model XLM-RoBERTa-base
Task Token Classification (POS tagging)
Language Kabyle (kab)
Training sentences ~1,200
Test F1 87.5%
Test Precision 87.3%
Test Recall 87.8%

Dataset

Annotated following the Universal Dependencies POS tagset: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X

Source: Tatoeba parallel corpus, semi-manually annotated.

Usage

from transformers import pipeline

tagger = pipeline(
    "token-classification",
    model="boffire/kabyle-pos",
    aggregation_strategy="simple"
)

result = tagger("Aṭas n medden i yessen.")

Limitations

This model has several known limitations that users should be aware of:

  1. Training data size: The model was trained on only ~1,200 sentences. This is small compared to high-resource language POS taggers, which may affect generalization to unseen vocabulary and syntactic constructions.

  2. Pre-annotation noise: Approximately 22% of tokens in the training data were pre-annotated by heuristic rules and not fully manually verified. This introduces some label noise, particularly for rare words and complex cliticized forms.

  3. Cliticized forms: The model struggles with complex cliticized verb forms (e.g., k-id-xeẓẓren, m-d-awiɣ, iyi-d-yewwi) that were not well-represented in the training data. These are often tagged as X (unknown).

  4. Domain bias: The training data comes from Tatoeba, which consists of short translated sentences. Performance may degrade on longer, more complex sentences from other domains (news, social media, literature).

  5. Proper nouns: Capitalized words are heuristically tagged as PROPN, which may misclassify sentence-initial common nouns or adjectives.

  6. Ambiguous tags: Some Kabyle words are genuinely ambiguous between tags (e.g., d as ADP vs. CCONJ, yella as AUX vs. VERB). The model may not consistently resolve these ambiguities.

  7. No diacritic normalization: The model treats e and ɛ, and d as completely different characters. Misspelled or inconsistently diacritized text may perform poorly.

  8. Single annotator bias: The annotations were created by a single annotator without inter-annotator agreement verification, which may introduce systematic tagging biases.

Citation

@inproceedings{dione-etal-2023-masakhapos,
  title = {MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages},
  author = {Dione, Cheikh M. Bamba and Adelani, David Ifeoluwa and Nabende, Peter and Alabi, Jesujoba and others},
  booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages = {10883--10900},
  year = {2023},
  publisher = {Association for Computational Linguistics},
  url = {https://aclanthology.org/2023.acl-long.609},
  doi = {10.18653/v1/2023.acl-long.609}
}

Side part of the Masakhane initiative for African NLP.

Downloads last month
42
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results