ZambiaSocialBERT

ZambiaSocialBERT is a multilingual text classification model for Zambian social-media language. It is based on multilingual BERT (google-bert/bert-base-multilingual-cased) and adapted using QLoRA, a parameter-efficient fine-tuning method.

Task

The model performs language identification on Zambian social-media text. It classifies an input text into one of five categories:

english
bemba_cibemba
nyanja_cinyanja
lusaka_slang
noisy_social_media

The noisy_social_media category captures highly informal, emoji-laden, conversational content, allowing the model to separate noisy social-media text from the four standard language categories.

Base model

google-bert/bert-base-multilingual-cased

Training data

The model was trained on a rebalanced Zambian social-media corpus of approximately 141,000 examples: around 35,000 examples for each of the four main language categories and 1,000 for the noisy_social_media category.

Training procedure

The model was adapted using QLoRA. The base encoder was loaded in 4-bit quantized form and frozen, with low-rank adapter layers trained on top and the classifier head unfrozen.

Quantization: 4-bit, NF4 quantization type, double quantization enabled, bfloat16 compute dtype.

LoRA configuration: rank (r) 8, alpha 16, dropout 0.05, target modules query, key, value, and dense, no bias, task type SEQ_CLS.

Training configuration: 6 epochs, learning rate 2e-5, cosine learning-rate scheduler, warmup ratio 0.1, per-device train batch size 24, gradient accumulation steps 6, weight decay 0.01, maximum gradient norm 0.5, paged AdamW 8-bit optimizer, bfloat16 mixed precision, gradient checkpointing enabled. The best checkpoint was selected on macro F1-score.

After training, the LoRA adapters were merged into the base encoder, so the published model has a standard mBERT architecture.

Intended use

ZambiaSocialBERT was developed as part of a master's dissertation on multilingual Aspect-Based Sentiment Analysis for low-resource Zambian ride-hailing social-media discourse. Its adapted encoder is reused as the starting point for the ZambiaABSA model.