ZambiaSocialBERT

ZambiaSocialBERT is a multilingual text classification model for Zambian social-media language. It is based on multilingual BERT (google-bert/bert-base-multilingual-cased) and adapted using QLoRA, a parameter-efficient fine-tuning method.

Task

The model performs language identification on Zambian social-media text. It classifies an input text into one of five categories:

  • english
  • bemba_cibemba
  • nyanja_cinyanja
  • lusaka_slang
  • noisy_social_media

The noisy_social_media category captures highly informal, emoji-laden, conversational content, allowing the model to separate noisy social-media text from the four standard language categories.

Base model

google-bert/bert-base-multilingual-cased

Training data

The model was trained on a rebalanced Zambian social-media corpus of approximately 141,000 examples: around 35,000 examples for each of the four main language categories and 1,000 for the noisy_social_media category.

Training procedure

The model was adapted using QLoRA. The base encoder was loaded in 4-bit quantized form and frozen, with low-rank adapter layers trained on top and the classifier head unfrozen.

Quantization: 4-bit, NF4 quantization type, double quantization enabled, bfloat16 compute dtype.

LoRA configuration: rank (r) 8, alpha 16, dropout 0.05, target modules query, key, value, and dense, no bias, task type SEQ_CLS.

Training configuration: 6 epochs, learning rate 2e-5, cosine learning-rate scheduler, warmup ratio 0.1, per-device train batch size 24, gradient accumulation steps 6, weight decay 0.01, maximum gradient norm 0.5, paged AdamW 8-bit optimizer, bfloat16 mixed precision, gradient checkpointing enabled. The best checkpoint was selected on macro F1-score.

After training, the LoRA adapters were merged into the base encoder, so the published model has a standard mBERT architecture.

Intended use

ZambiaSocialBERT was developed as part of a master's dissertation on multilingual Aspect-Based Sentiment Analysis for low-resource Zambian ride-hailing social-media discourse. Its adapted encoder is reused as the starting point for the ZambiaABSA model.

Limitations

The model was trained on synthetic Zambian social-media text. Performance on naturally occurring data may differ. It performs language identification only and does not predict sentiment.

image

image

image

image

image

Downloads last month
78
Safetensors
Model size
0.2B params
Tensor type
F32
F16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for kelvinmbewe/ZambiaSocialBERT

Quantized
(6)
this model
Finetunes
1 model