Assamese Tokenizer

অসমীয়া ভাষাৰ বাবে এটি টোকেনাইজাৰ।

A tokenizer for the Assamese language (অসমীয়া). It converts Assamese text into tokens, smaller units that AI models can process and learn from.

What is a tokenizer?

Computers & AI models process numerical data, not natural language. A tokenizer bridges this gap by converting text into numerical representations, it breaks sentences into smaller units called tokens and assigns each token a unique numeric identifier.

For example, "অসম এখন ধুনীয়া ৰাজ্য" is split into 5 tokens:

অসমএখনধুনীয়াৰাজ্য

Each token has a numeric ID. A language model trained on these IDs learns which tokens follow which, capturing grammar, style, and meaning.

Why this tokenizer exists

Most tokenizers are designed for English or Hindi. Assamese support is limited and often inadequate. This tokenizer was built from scratch for Assamese language, it understands the Assamese script, handles compound words, and covers the full character set.

  • 32,000 tokens — common words remain intact; rare words split naturally
  • Zero unknown tokens — every Assamese character is recognized
  • Lossless roundtrip — encoding and decoding produces the original text
  • Assamese digits work individually২০২৪ is split into separate digits rather than merged

Special tokens

These tokens are used for chat and instruction-following models:

<|system|> <|user|> <|assistant|> <|endoftext|>

Training data

Trained on 12.5 million Assamese sentences collected from public sources including news, books, Wikipedia, and web content. The data was cleaned, filtered for quality, and deduplicated.

Usage

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

text = "অসম এখন ধুনীয়া ৰাজ্য।"
ids = sp.EncodeAsIds(text)
pieces = sp.EncodeAsPieces(text)
decoded = sp.DecodeIds(ids)

print(f"Tokens: {len(pieces)}, IDs: {ids}")
print(f"Match: {decoded == text}")

Output:

Tokens: 5, IDs: [346, 344, 4628, 550, 282]
Match: True

Training an Assamese language model

The tokenizer is the foundation. Here is how it fits into a complete training pipeline:

Step 1 — Tokenize your data

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

with open("corpus.txt", "r", encoding="utf-8") as f:
    text = f.read()

ids = sp.EncodeAsIds(text)

Step 2 — Train a model Feed the token IDs into a transformer architecture. The model learns to predict the next token in a sequence, which teaches it Assamese grammar and style.

Step 3 — Generate text

prompt = "অসম এখন"
prompt_ids = sp.EncodeAsIds(prompt)

# The model predicts subsequent tokens one at a time
# generated_ids = model.generate(prompt_ids)

# Convert the output back to Assamese
# generated_text = sp.DecodeIds(generated_ids)

The tokenizer remains the same throughout, it is used for both training and inference.

Files

File Description
tokenizer.model The trained tokenizer model
tokenizer.vocab Vocabulary of 32,000 tokens with scores
demo.py Example script demonstrating usage

Author

Anand Dey

eMail - ananddey.nic@gmail.com

License

MIT

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support