This is super helpful, thanks! I'll get up to speed on the literature and keep your use case in mind :)
Omar Kamali PRO
AI & ML interests
Recent Activity
Organizations
So you basically still want ASR-style transcription before the LLM kicks in (perhaps to reduce hallucination? or another purpose?), but would like the representation to be more rich so a downstream LLM can still reason about pronunciation, pauses and so on?
Hah yeah that rendering bug is for sure a meta joke (played on me :D).
Speech is for sure something I'd like to address. This work is deeply grounded in phonetics as you guessed (I wrote a paper on this topic because I love word plays https://doi.org/10.14746/linpo.2025.67.1.8 and it's kinda a precursor to this method) so it must work with audio. Just have to figure out the right way and objective.
What are the most critical gaps you see in voice AI that need an improvement?
I knowww. Need to fix the video pipeline lol
Thanks @alfredo-ottomate ! In principle, it should be faster than a conventional LLM at the same scale while also using less VRAM. Mostly because it removes the softmax layer, which is one of the more expensive operations in standard language models. It also removes the embedding table, which usually accounts for roughly 10-20% of the parameters. For example, in Qwen 3.5 4B, thatβs about 700M embedding parameters eliminated.
Raw performance-wise, I expect around ~10% generation speed up per-token, ~10% less VRAM usage, and better use of the context window since each token means a full word, not a subword piece.
The question then is how many parameters my replacement mechanism will ultimately need to stay competitive. The approach is already working surprisingly well at around 4M parameters, which is about 0.6% of the alternative at 4B total. Even if that number grows, the efficiency upside still looks very promising.
Fingers crossed! βοΈ
Quick update, it seems to mostly work as intended π€―
More details here:
https://x.com/OmarKamali/status/2036932984226320748
- 8 Evals
- 10 Models (GlotLID, OpenLID, our own Gherbal and others)
- 200+ Languages
- One Leaderboard To Rule Them All!
Come find your language and which LID model supports it best in this space π
omneity-labs/lid-benchmark
I added a decoding head to the LLM, so the MLP generates a latent word vector that gets decoded by a GRU into a valid word.
I'm using the same input representation and train a joint encoder-decoder which gets further fine-tuned as part of the "Next Latent Prediction"(?) objective and it seems to be pretty decent for a first shot. Still working out some of the kinks.
That's planned! I'm just running a few more experiments before locking it in.
Will let you know first @unmodeled-tyler :)
I'm training a 22M params LLM rn to test this "thing" and it's able to formulate coherent sentences π€―
Bear in mind, this is a completely new, tokenizer-free LLM architecture with built-in language universality.
Check the explainer video to understand what's happening. Feedback welcome on this approach!
In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."
He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
β’ For English, that's 700,000 missing articles.
β’ For Moroccan Arabic, 30% of the language's entire Wikipedia.
β’ For 31 other languages, there was literally no text corpus at all until recently.
I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).
Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.
Here's the full story of how I built Wikipedia Monthly π
https://omarkamali.com/blog/wikipedia-monthly-pipeline
I just released omarkamali/wikipedia-labels, with all the structural labels and namespace from wikipedia in 300+ languages. A gift for the data preprocessors and cleaners among us.
Happy new year 2026 everyone! π
Thanks Louis!
That's a great idea, stay tuned for the next version of picomon π€
- Supports all of AMD, Nvidia and Apple Silicon π§βπ§βπ§βπ§
- Beautiful TUI with themes (who said monitoring should be boring?) π
- Shareable Rig Cards! Boast to friends, family and foes alike π«¨
Get it now!
uvx picomon or pip install picomon then picomon picomon! AMD GPU Monitoring made easyJust run
uvx picomon and behold:ββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββ
β GPU 0 GFX 42% UMC 21% β β GPU 1 GFX 78% UMC 66% β
β PWR 135/250W (54%) VRAM 10.0/16.0GB 62% β β PWR 210/250W (84%) VRAM 14.5/16.0GB 90% β
β β β β
β GFX βββββββ
βββββββ
ββββ β β GFX ββββ
βββββββ
βββββ
β β
β PWR ββββββββ
βββββββ
βββ β β PWR βββββ
βββββββ
βββββ β
β VRM ββββββββ
ββββββββ
ββ β β VRM ββββ
βββββββββ
ββββ β
ββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββRepo at https://github.com/omarkamali/picomon
Or pypi at https://pypi.org/project/picomon
γ» Fixed a bug to remove infobox leftovers and other wiki markers such as
__TOC__γ» New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose)
γ» Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!).
γ» Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.
Check out the dataset:
omarkamali/wikipedia-monthly
Hey @MarcusLammers , this is a great idea! I will try to include it in a future iteration based on Wikipedia categories.
Highlights of October's edition:
Β· π£οΈ 341 languages
Β· π 64.7M articles (+2.5%)
Β· π¦ 89.4GB of data (+3.3%)
We are now sampling a random subset of each language with a reservoir sampling method to produce splits
1000, 5000, and 10000 in addition to the existing train split that contains all the data.Now you can load the english (or your favorite language) subset in seconds:
dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")Happy data engineering! π§°
omarkamali/wikipedia-monthly
lay 7fdek a khouya @ayymen !
Highlights of this edition:
Β· π£οΈ 341 languages
Β· π 63.1M articles
Β· π¦ 86.5GB of data
This update also solves upload issues in the August edition where some languages had missing parts. Happy data engineering!
omarkamali/wikipedia-monthly


