ZambiaSocialBERT
ZambiaSocialBERT is a multilingual text classification model for Zambian social-media language. It is based on multilingual BERT (google-bert/bert-base-multilingual-cased) and adapted using QLoRA, a parameter-efficient fine-tuning method.
Task
The model performs language identification on Zambian social-media text. It classifies an input text into one of five categories:
- english
- bemba_cibemba
- nyanja_cinyanja
- lusaka_slang
- noisy_social_media
The noisy_social_media category captures highly informal, emoji-laden, conversational content, allowing the model to separate noisy social-media text from the four standard language categories.
Base model
google-bert/bert-base-multilingual-cased
Training data
The model was trained on a rebalanced Zambian social-media corpus of approximately 141,000 examples: around 35,000 examples for each of the four main language categories and 1,000 for the noisy_social_media category.
Training procedure
The model was adapted using QLoRA. The base encoder was loaded in 4-bit quantized form and frozen, with low-rank adapter layers trained on top and the classifier head unfrozen.
Quantization: 4-bit, NF4 quantization type, double quantization enabled, bfloat16 compute dtype.
LoRA configuration: rank (r) 8, alpha 16, dropout 0.05, target modules query, key, value, and dense, no bias, task type SEQ_CLS.
Training configuration: 6 epochs, learning rate 2e-5, cosine learning-rate scheduler, warmup ratio 0.1, per-device train batch size 24, gradient accumulation steps 6, weight decay 0.01, maximum gradient norm 0.5, paged AdamW 8-bit optimizer, bfloat16 mixed precision, gradient checkpointing enabled. The best checkpoint was selected on macro F1-score.
After training, the LoRA adapters were merged into the base encoder, so the published model has a standard mBERT architecture.
Intended use
ZambiaSocialBERT was developed as part of a master's dissertation on multilingual Aspect-Based Sentiment Analysis for low-resource Zambian ride-hailing social-media discourse. Its adapted encoder is reused as the starting point for the ZambiaABSA model.
Limitations
The model was trained on synthetic Zambian social-media text. Performance on naturally occurring data may differ. It performs language identification only and does not predict sentiment.
- Downloads last month
- 78




