Assamese Tokenizer
অসমীয়া ভাষাৰ বাবে এটি টোকেনাইজাৰ।
A tokenizer for the Assamese language (অসমীয়া). It converts Assamese text into tokens, smaller units that AI models can process and learn from.
What is a tokenizer?
Computers & AI models process numerical data, not natural language. A tokenizer bridges this gap by converting text into numerical representations, it breaks sentences into smaller units called tokens and assigns each token a unique numeric identifier.
For example, "অসম এখন ধুনীয়া ৰাজ্য" is split into 5 tokens:
অসম → এখন → ধুনীয়া → ৰাজ্য → ।
Each token has a numeric ID. A language model trained on these IDs learns which tokens follow which, capturing grammar, style, and meaning.
Why this tokenizer exists
Most tokenizers are designed for English or Hindi. Assamese support is limited and often inadequate. This tokenizer was built from scratch for Assamese language, it understands the Assamese script, handles compound words, and covers the full character set.
- 32,000 tokens — common words remain intact; rare words split naturally
- Zero unknown tokens — every Assamese character is recognized
- Lossless roundtrip — encoding and decoding produces the original text
- Assamese digits work individually —
২০২৪is split into separate digits rather than merged
Special tokens
These tokens are used for chat and instruction-following models:
<|system|> <|user|> <|assistant|> <|endoftext|>
Training data
Trained on 12.5 million Assamese sentences collected from public sources including news, books, Wikipedia, and web content. The data was cleaned, filtered for quality, and deduplicated.
Usage
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
text = "অসম এখন ধুনীয়া ৰাজ্য।"
ids = sp.EncodeAsIds(text)
pieces = sp.EncodeAsPieces(text)
decoded = sp.DecodeIds(ids)
print(f"Tokens: {len(pieces)}, IDs: {ids}")
print(f"Match: {decoded == text}")
Output:
Tokens: 5, IDs: [346, 344, 4628, 550, 282]
Match: True
Training an Assamese language model
The tokenizer is the foundation. Here is how it fits into a complete training pipeline:
Step 1 — Tokenize your data
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
with open("corpus.txt", "r", encoding="utf-8") as f:
text = f.read()
ids = sp.EncodeAsIds(text)
Step 2 — Train a model Feed the token IDs into a transformer architecture. The model learns to predict the next token in a sequence, which teaches it Assamese grammar and style.
Step 3 — Generate text
prompt = "অসম এখন"
prompt_ids = sp.EncodeAsIds(prompt)
# The model predicts subsequent tokens one at a time
# generated_ids = model.generate(prompt_ids)
# Convert the output back to Assamese
# generated_text = sp.DecodeIds(generated_ids)
The tokenizer remains the same throughout, it is used for both training and inference.
Files
| File | Description |
|---|---|
tokenizer.model |
The trained tokenizer model |
tokenizer.vocab |
Vocabulary of 32,000 tokens with scores |
demo.py |
Example script demonstrating usage |
Author
Anand Dey
eMail - ananddey.nic@gmail.com
License
MIT
- Downloads last month
- -