imomayiz/darija-english
Viewer β’ Updated β’ 87.8k β’ 1.43k β’ 12
Darija2Vec-SOTA-300D is a high-performance Word2Vec embedding model specifically engineered for the Moroccan dialect (Darija). Developed as a State-of-the-Art (SOTA) resource, it addresses the unique challenges of Moroccan Arabic NLP, particularly the heavy use of code-switching and diverse orthographic scripts (Arabic and Latin/Arabizi).
Unlike standard embeddings that treat different scripts as separate languages, this model implements a Script Unification Pipeline:
ana β Ψ£ΩΨ§, ghadi β ΨΊΨ§Ψ―Ω). This doubles the statistical density for core semantic concepts.| Parameter | Configuration |
|---|---|
| Model Type | Word2Vec Skip-gram (sg=1) |
| Vector Dimensions | 300 |
| Window Size | 7 (optimized for Darija syntax) |
| Corpus Size | ~317,141 unique sentences |
| Min Word Count | 5 |
| Training Epochs | 15 |
The model was trained on a consolidated corpus combining the best available public resources:
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
# Download the SOTA vectors
repo_id = "halimbahae/Darija2Vec-SOTA-300D"
vector_file = hf_hub_download(repo_id=repo_id, filename="darija2vec_sota_vectors.txt")
# Load into Gensim
wv = KeyedVectors.load_word2vec_format(vector_file, binary=False)
# Explore similarities
print(wv.most_similar("Ω
Ψ²ΩΨ§Ω", topn=5))
print(wv.most_similar("Ψ·ΩΩ
ΩΨ¨ΩΩ", topn=5))