Instructions to use StentorLabs/Stentor3-20M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use StentorLabs/Stentor3-20M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="StentorLabs/Stentor3-20M") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor3-20M") model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor3-20M") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use StentorLabs/Stentor3-20M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "StentorLabs/Stentor3-20M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "StentorLabs/Stentor3-20M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/StentorLabs/Stentor3-20M
- SGLang
How to use StentorLabs/Stentor3-20M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "StentorLabs/Stentor3-20M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "StentorLabs/Stentor3-20M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "StentorLabs/Stentor3-20M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "StentorLabs/Stentor3-20M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use StentorLabs/Stentor3-20M with Docker Model Runner:
docker model run hf.co/StentorLabs/Stentor3-20M
- Stentor3-20M
- What Is This?
- The Stentor3 Name
- 📋 Table of Contents
- Model Architecture
- Architectural Foundations: The Depth Delusion & Width vs. Depth Tradeoffs
- Head-to-Head: StentorLabs Model Family
- Memory Requirements
- 🚀 Quick Start
- ⚠️ Important Limitations
- 📋 Honest Notices
- Training Infrastructure
- Training Hyperparameters — Complete Reference
- Precision Stability Recipe
- Data Pipeline
- Weight Initialization
- Evaluation & Results
- Benchmark Results
- Model Outputs
- Training Dynamics
- Use Cases & Intended Uses
- Out-of-Scope Uses
- Ethical Considerations & Societal Impact
- Inference Guide
- 🚀 Free Inference — Try It Now
- The app maintains a library of:
- Internal Models: Portimbria-150M, Stentor3 (50M/20M), Stentor2 (30M/12M), and original Stentor (30M/12M + Instruct).
- Arena Baselines: DistilGPT2, Pythia (14M through 160M), gpt2 small, SmolLM2-135M, OPT-125M, and GPT-Neo.
- Quantization
- 🌍 Community Contributions — Build on This Model
- Format Conversion
- Speculative Decoding
- Related Work
- Environmental Impact
- Citation
- Model Card Contact
- What Is This?
🔬 Research Artifact & Base Language Model. Stentor3-20M is a next-token predictor — not a chat assistant. It has no safety tuning and should not be deployed in user-facing applications without fine-tuning first. It is, however, a high-quality open foundation: fine-tune it, quantize it, convert it, distill from it, run LoRA on it, adapt it to your domain, or build anything else you can imagine — and please publish your results! See Intended Uses for details.
💡 Built by a solo 14-year-old developer, on a laptop, for $0. Every model StentorLabs has released — including this one — was conceived, designed, coded, and trained without a budget, a team, a GPU cluster, or institutional support. The total cost of producing Stentor3-20M was zero dollars, using free Kaggle TPU quota and publicly available datasets. This is what democratized AI research looks like.
What Is This?
Stentor3-20M is the sibling model to Stentor3-50M in the Stentor3 family — a new generation of StentorLabs models trained on a full web/code/math curriculum with a 4096-token context window.
At ~20M parameters, Stentor3-20M is a base causal language model trained entirely from scratch on free-tier Kaggle compute using a Google Cloud TPU v5e-8 (eight chips). It was trained for a full 32,425 steps on approximately 8.5 billion tokens drawn from a web/code/math curriculum, completing epoch 0 in ~8.1 hours.
Like all StentorLabs models, this is a base next-token predictor, not a chat assistant. It will not reliably follow instructions, has no safety tuning, and is best suited for research, prototyping, speculative decoding, and infrastructure experiments.
The key architectural features: Grouped Query Attention (10 query heads, 2 KV heads — 5:1 ratio), a compact TokenMonster vocabulary (4,096 tokens), and a 4096-token context window.
GQA at this scale worked cleanly: 10 query heads with 2 KV heads delivered an 80% KV cache reduction at inference time with no training instability observed.
The Stentor3 Name
Why "Stentor3"?
Stentor is a genus of large, trumpet-shaped single-celled organisms — simple in structure, surprisingly capable for their scale. These organisms are among the largest known unicellular life forms, reaching up to 2mm in length, yet they operate as a single cell with no specialized organs — a fitting metaphor for a compact language model that punches far above its weight class. Like the stentor, which can regenerate from fragments and adapt to changing environments, this model is designed to be adaptable and resilient, serving as a foundation that can be fine-tuned, quantized, distilled, and adapted across countless downstream tasks. The name has anchored nearly every StentorLabs model since the beginning: compact, functional, and punching above its weight.
📋 Table of Contents
Model Architecture
Stentor3-20M is a LlamaForCausalLM model with Grouped Query Attention (GQA), a 4,096-token TokenMonster vocabulary, and a 4096-token context window.
| Component | Value | Notes |
|---|---|---|
| Architecture | LlamaForCausalLM |
Standard transformer decoder |
| Hidden Size | 320 | |
| Intermediate Size (FFN) | 1,280 | SwiGLU activation |
| Num Hidden Layers | 12 | |
| Num Attention Heads | 10 | |
| Num Key/Value Heads | 2 | GQA — 5:1 query-to-KV ratio |
| Context Length | 4,096 tokens | |
| Head Dimension | 32 | |
| Vocab Size | 4,096 | TokenMonster english-4096-strict-nocapcode-v1 |
| Total Parameters | 20,324,160 | |
| Positional Encoding | RoPE | rope_theta = 50,000.0 |
| Activation | SiLU | SwiGLU implementation |
| Tied Embeddings | False | Separate input/output embeddings |
| Implementation | SDPA | PyTorch Scaled Dot Product Attention |
GQA explanation & parameter count breakdown
Why GQA?
Grouped Query Attention (8Q, 2KV) reduces the KV cache memory footprint by 75% at inference time compared to standard Multi-Head Attention at the same hidden size. At a 4096-token context window this matters substantially: the KV cache for a single sequence is proportional to 2 × num_kv_heads × head_dim × num_layers × seq_len. With 2 KV heads instead of 8, the cache shrinks to one-quarter of its full-MHA equivalent, enabling longer generation on memory-constrained hardware.
GQA at this scale worked cleanly — the training was stable from start to finish. However, I should note that this was not the case for every Stentor model. During development of Stentor2-12M-Preview, GQA (at that smaller 12M scale) was a major contributor to training instability. For the Stentor2 series I switched to full MHA because the KV-head reductions kept collapsing. This experience shaped how I approached GQA for Stentor3: at 20M and especially at 50M (over 2× larger), the stability margin is substantially wider. GQA was a major factor in the 12M instability, but there were other contributing factors (training recipe, hardware precision handling) that compounded the problem. By Stentor3, the new TPU-based training regime provided stable training end-to-end, and GQA worked without issue.
Why TokenMonster-4k?
Two deliberate reasons — both unconventional for a model this size.
Reason 1 — Non-tied embeddings become affordable.
Most small language models use tied (shared) input and output embedding weights. The original motivation comes from Press & Wolf (2016): sharing the two matrices reduces parameter count significantly, acts as a form of regularization, and was shown to improve perplexity in early recurrent models. With a 32K vocabulary, the two matrices together would consume ~33M parameters on top of a ~20M transformer backbone — essentially doubling the model. Tying halves that cost, so at large vocabularies tying is practically necessary for small models.
With a 4K vocabulary the math changes completely. Each embedding matrix is only ~1.3M parameters. Keeping them separate adds just 1.3M parameters — 6.4% of Stentor3-20M's total. That's an affordable price to let the two matrices specialize.
Why does that specialization matter? The input embedding is asked to encode a token into a vector that represents its meaning in context. The output projection is asked to decode a hidden state into which token should come next. These are related but genuinely distinct tasks. A 2026 paper (Weight Tying Biases Token Embeddings Towards the Output Space, Lopardo et al., arXiv 2603.26663) provides mechanistic evidence that tied weights skew the shared matrix toward output prediction at the cost of input representation quality — and that this trade-off hurts more as model scale increases. The Qwen3 family explicitly ties embeddings for its smaller models and unties them for larger ones, acknowledging that the calculus shifts with scale.
My opinion, not tested: At 20M parameters with a 4K vocab, the untied design likely gives the model better input representations without meaningfully harming output quality. The 12.9% embedding share is high compared to 70B-parameter models (where vocab params are a rounding error), but it is tiny relative to typical models at this size — most comparable models spend 16–33% of their parameter budget on embeddings at 32K vocab. The extra 1.3M parameters are well spent. This is a design preference I believe in; I have not run an ablation to prove it.
Reason 2 — Each token type gets 8× more training signal.
With 8.5B training tokens spread over 4,096 token types, each type appears on average ~2.1 million times during training. With a 32K vocabulary, the same 8.5B tokens would give each type only ~266,000 training examples on average — an 8× reduction. To achieve the same per-type coverage with 32K vocab would require approximately 67.2B tokens, which is far beyond what Kaggle's weekly free TPU quota allows.
The acknowledged downside: a 4K vocabulary tokenizes less efficiently than a 32K one — each document produces longer token sequences, which increases context usage and compute per sample. I consider this an acceptable trade-off at this scale. The training signal density gain outweighs the efficiency cost for a model training on free compute with a hard token budget.
Parameter Count Breakdown
def estimate_llama_params_gqa(vocab_size, hidden_size, intermediate_size,
num_hidden_layers, num_attention_heads, num_key_value_heads):
kv_dim = int(hidden_size * num_key_value_heads / num_attention_heads)
q_proj = hidden_size * hidden_size
k_proj = hidden_size * kv_dim
v_proj = hidden_size * kv_dim
o_proj = hidden_size * hidden_size
attn = q_proj + k_proj + v_proj + o_proj
mlp = 3 * hidden_size * intermediate_size # gate, up, down
norm = 2 * hidden_size # input + post-attention RMSNorm
total = 2 * vocab_size * hidden_size + num_hidden_layers * (attn + mlp + norm) + hidden_size
return total
Plugging in Stentor3-20M values:
kv_dim = 320 × (2/10) = 64
q_proj = 320 × 320 = 102,400
k_proj = 320 × 64 = 20,480
v_proj = 320 × 64 = 20,480
o_proj = 320 × 320 = 102,400
attn/layer = 245,760
mlp/layer = 3 × 320 × 1,280 = 1,228,800
norm/layer = 2 × 320 = 640
per_layer = 1,475,200
embedding = 4,096 × 320 = 1,310,720
lm_head = 4,096 × 320 = 1,310,720 (untied)
layers = 12 × 1,475,200 = 17,702,400
final_norm = 320
total = 1,310,720 + 1,310,720 + 17,702,400 + 320 = 20,324,160 ✓
| Component | Parameters | % of Total |
|---|---|---|
| Embedding Table | 1,310,720 | 6.4% |
| LM Head (untied) | 1,310,720 | 6.4% |
| Transformer Layers × 12 | 17,702,400 | 87.1% |
| — Attention (per layer × 12) | 2,949,120 | 14.5% |
| — FFN/MLP (per layer × 12) | 14,745,600 | 72.5% |
| — Layer Norms (per layer × 12) | 7,680 | 0.04% |
| Final RMS Norm | 320 | 0.002% |
| Total | 20,324,160 | 100% |
With a compact 4K vocabulary, embedding takes only 12.9% of the parameter budget — leaving 87.1% for the transformer stack. This is an unusually efficient allocation, made possible by TokenMonster's English-optimized compression.
Architectural Foundations: The Depth Delusion & Width vs. Depth Tradeoffs
Stentor3-20M implements a "Width over Depth" philosophy informed by recent research into small model scaling.
- The Depth Delusion (Wu et al., 2026): Argues that for AI models, increasing depth has diminishing returns. Stentor3-20M uses a balanced 12-layer stack with a 320-hidden dimension size to maximize overall efficiency.
- The Depth Myth (Izumoto, 2026): Synthesizes findings from 30+ papers to argue that modern LLMs are systematically too deep and narrow.
Head-to-Head: StentorLabs Model Family
Comparison table across all generations
| Property | Stentor-12M (Gen 1) | Stentor-30M (Gen 1) | Stentor2-12M | Stentor2-30M | Stentor3-20M | Stentor3-50M | Portimbria-150M |
|---|---|---|---|---|---|---|---|
| Vocabulary | 32,768 (Mistral BPE-32k) | 32,768 (Mistral BPE-32k) | 8,064 (TokenMonster-8k) | 8,064 (TokenMonster-8k) | 4,096 (TokenMonster-4k) | 4,096 (TokenMonster-4k) | 32,768 (Mistral BPE-32k) |
| Hidden Size | 192 | 256 | 256 | 512 | 320 | 512 | 768 |
| Intermediate Size | 576 | 1,024 | 512 | 1,024 | 1,280 | 1,536 | 2,048 |
| Num Layers | 9 | 21 | 12 | 10 | 12 | 15 | 20 |
| Attention Heads | 3 | 4 | 4 | 8 | 5 | 8 | 6 |
| KV Heads | 3 (MHA) | 4 (MHA) | 4 (MHA) | 8 (MHA) | 1 (GQA) | 2 (GQA) | 2 (GQA) |
| Head Dimension | 64 | 64 | 64 | 64 | 64 | 64 | 128 |
| Context Length | 512 | 512 | 1,024 | 1,024 | 4,096 | 4,096 | 4,096 |
| Total Parameters | 12.0M | 30.4M | 12.3M | 30.4M | 20.3M | 49.4M | 151.0M |
| Embedding Share | 52.2% | 27.6% | 16.8% | 13.6% | 12.9% | 8.5% | 16.7% |
| Tied Embeddings | Yes | Yes | Yes | Yes | No | No | Yes |
| Training Tokens | ~200M | ~600M | ~480M | ~800M | ~8.5B | ~8.3B | ~6B |
| Training Hardware | 2× T4 | 1× T4 | 2× T4 | 2× T4 | TPU v5e-8 | TPU v5e-8 | TPU v5e-8 |
| Training Time | ~1.3h | ~7.88h | ~5h | ~6.75h | ~8.16h | ~8.74h | ~8h |
| Best Perplexity | 89.01 | 33.02 | 26.61 | 18.07 | 14.02 | 9.64 | 18.00 |
| Tokenizer | Mistral BPE-32k | Mistral BPE-32k | TokenMonster-8k | TokenMonster-8k | TokenMonster-4k | TokenMonster-4k | Mistral BPE-32k |
Cross-generation comparison guide:
- Gen 1 (Stentor-12M, Stentor-30M) and Portimbria-150M all use Mistral BPE-32k and are the closest cross-gen comparisons. Other architectural differences (context length, GQA, training data) still affect results, but the shared tokenizer makes comparisons more meaningful than across tokenizer families.
- Gen 2 (Stentor2-12M, Stentor2-30M) use TokenMonster-8k and are only directly comparable within Gen 2.
- Gen 3 (Stentor3-20M, Stentor3-50M) use TokenMonster-4k and are only directly comparable within Gen 3.
- For a controlled strength comparison between Stentor3 models, Portimbria, and other similar-scale models evaluated under identical conditions, see the Benchmark Results section.
PPL comparison caveat: Perplexity is not directly comparable across tokenizer families. A smaller vocabulary mechanically lowers raw perplexity. Within Gen 3, the comparison between Stentor3-20M (14.02) and Stentor3-50M (9.64) is valid — same tokenizer, same training setup.
Memory Requirements
How much VRAM you need depends on precision and whether you're generating (which activates the KV cache). The table below covers a single sequence at full 4096-token context — KV cache scales linearly, so at 1024 tokens it's roughly ¼ of the values shown.
| Precision | Weights | KV Cache (4096 ctx) | Total VRAM |
|---|---|---|---|
| FP32 | ~81 MB | ~12 MB | ~93 MB |
| FP16 / BF16 | ~41 MB | ~6 MB | ~47 MB |
| INT8 | ~21 MB | ~6 MB | ~27 MB |
| INT4 | ~11 MB | ~6 MB | ~17 MB |
KV cache note: GQA (2 KV heads) reduces the KV cache by 80% vs standard MHA at the same hidden size. Formula:
2 (K+V) × 2 (KV heads) × 32 (head_dim) × 12 (layers) × seq_len × bytes_per_element.
Weights note: Weights are saved as FP32 in safetensors. Cast on load with
torch_dtype=torch.float16ortorch_dtype=torch.bfloat16to halve weight memory. INT8/INT4 figures require bitsandbytes quantization as shown in the Quantization section.
🚀 Quick Start
1. Install Dependencies
pip install transformers torch safetensors tokenmonster
2. Load the Model
from transformers import AutoModelForCausalLM
import torch
# TokenMonsterTokenizer is saved alongside the model and loaded via trust_remote_code
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(
"StentorLabs/Stentor3-20M",
trust_remote_code=True,
)
model = model.eval()
ℹ️ TokenMonster tokenizer. Stentor3-20M uses a compact 4,096-token TokenMonster vocabulary.
trust_remote_code=Trueloads the bundledtokenmonster_hf.pywrapper automatically. You also needpip install tokenmonsterfor the underlying binary.
3. Generate Text
prompt = "The history of computing began"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(next(model.parameters()).device)
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():
output = model.generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=150,
do_sample=True,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
generated = output[0][input_ids.shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
Pipeline usage & recommended generation settings
4. Using the Pipeline
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="StentorLabs/Stentor3-20M",
torch_dtype=torch.float16,
device_map="auto",
)
result = pipe(
"Neural networks are computational models",
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
print(result[0]["generated_text"])
5. Recommended Generation Settings
These settings are based on manual testing and personal preference — not verified numerical benchmarks.
| Parameter | Recommended Range | Notes |
|---|---|---|
temperature |
0.5 – 0.8 | Lower end for focused/factual output; upper end for more varied generation |
top_p |
0.8 – 0.9 | Nucleus sampling range that works well in practice |
repetition_penalty |
1.15 – 1.4 | Strongly recommended; prevents repetition loops common in small base LMs |
max_new_tokens |
40 – 4,096 | Depends on task |
⚠️ Important Limitations
Not Instruction-Tuned: This is a base model. It will continue text, not follow instructions.
No Safety Tuning: No RLHF, no DPO, no content filtering.
Limited Factual Reliability: 20M parameters cannot store reliable world knowledge.
Context Window: Hard limit of 4,096 tokens.
English Only: TokenMonster english-4096 is optimized for English; other languages will tokenize poorly.
TokenMonster Required: Loading the tokenizer requires
pip install tokenmonsterandtrust_remote_code=True.Repetition Without Penalty: Always use
repetition_penalty ≥ 1.15.
📋 Honest Notices
Candid first-hand observations about this model
These are candid observations based on direct generation testing.
It generates coherent sentences. The model reliably produces grammatical English. Even on confusing or unusual prompts, outputs are mostly real words in valid sentences — just often off-topic ones.
It struggles to stay on topic. The most consistent failure mode. A prompt about exercise leads to a rambling passage about muscle balance and pain management. A prompt about books triggers a response about programming language acronyms. The model grasps the subject area but veers into loosely related territory within a few sentences. Longer, more specific prompts help significantly.
It generates Unicode replacement characters (❑). A small number of responses contain
❑(U+FFFD) characters. The model was not trained on StenCore-PDF and has no obvious clean source for this token — it likely appeared in code comments or data artifacts in the web/code corpus. Uncommon but reproducible.It will generate code even when the prompt has nothing to do with code. Several average responses on non-code prompts contain code-like syntax, variable names, or programming terminology. The code training data bleeds into general text generation, especially on short or ambiguous prompts like
"Hi!!!"or"1 + 1 =".Decent code structure for its size and training data, but not reliable for autocomplete. Given a Python code prompt, the model produces recognizable structure — for loops, print statements, function calls. The logic is wrong and the code would not run. It has clear potential if trained on significantly more code, but is not suitable for production use as-is.
Horrible at math.
"1 + 1 ="produces a repetition loop about the value of a variable. The model knows that math prompts involve numbers and formulas but has no arithmetic ability.Repeats a lot, especially under short or ambiguous prompts.
"Hi!!!"collapses into a loop about variable values."1 + 1 ="repeats the same phrase more than a dozen times before hitting the 200-token limit. A repetition penalty of at least 1.15 is essential — without it, short prompts frequently collapse into tight loops.
Training Infrastructure
Hardware, software stack & throughput details
Hardware
| Component | Specification |
|---|---|
| Accelerator | Google Cloud TPU v5e |
| Chip Configuration | 8-chip pod slice (v5e-8) |
| Active Training Processes | 8 (one per chip via torchrun + PJRT) |
| Global Batch Tokens/Step | 262,144 (8 × 4,096 × 8 processes) |
| Platform | Kaggle Notebooks (free tier) |
| Orchestration | HuggingFace Accelerate + torchrun |
| Process Group Init | env:// (XLA backend) |
Software Stack
| Package | Role |
|---|---|
| PyTorch 2.6 | Core tensor operations |
| torch_xla 2.6 | XLA/TPU backend |
| HuggingFace Transformers | Model architecture (LlamaForCausalLM) |
| HuggingFace Accelerate | Distributed training orchestration |
| HuggingFace Datasets | Data loading and streaming |
| safetensors | Model serialization |
Throughput
| Metric | Value |
|---|---|
| Average global tokens/sec | ~304,000 |
| Per-chip tokens/sec | ~38,000 |
| Total training tokens | ~8,500,000,000 |
| Total wall-clock time | 29,323s (~8.15h) |
| Optimizer steps completed | 32,425 (full run — no session deadline hit) |
Training Hyperparameters — Complete Reference
Full hyperparameter tables (optimizer, batch, schedule, checkpointing)
Core Training Parameters
| Hyperparameter | Value | Notes |
|---|---|---|
learning_rate |
8e-4 | Peak AdamW LR |
weight_decay |
0.01 | Applied to Linear weights only |
max_grad_norm |
1.0 | Gradient clipping |
optimizer |
AdamW | betas=(0.9, 0.95), eps=1e-8 |
scheduler |
Cosine | With linear warmup |
warmup_steps |
1,621 | 5% of max_train_steps |
stable_steps |
25,940 | 80% of max_train_steps (WSD reference; cosine scheduler used) |
max_train_steps |
32,425 | Full run completed |
token_budget |
8,500,000,000 | Total training token cap |
source_token_budget |
8,500,000,000 | Source data token cap |
seed |
42 | |
mixed_precision |
bf16 | Native TPU BF16 |
Batch & Sequence Parameters
| Hyperparameter | Value | Notes |
|---|---|---|
per_device_train_batch_size |
8 | Per TPU chip |
num_processes |
8 | One per chip |
total_batch_size |
64 | 8 × 8 |
block_size |
4,096 | Sequence / context length |
tokens_per_optimizer_step |
262,144 | total_batch_size × block_size |
gradient_accumulation_steps |
1 | No accumulation |
num_train_epochs |
1 | Token budget exhausted within epoch 0 |
pack |
True | Required for TPU static shapes |
Evaluation & Checkpointing
| Hyperparameter | Value |
|---|---|
eval_steps |
1,000 |
best_eval_steps |
1,000 |
best_eval_start_step |
1,000 |
max_eval_samples |
5,000 |
AdamW Optimizer — Detailed
Decay group: All
nn.Linearweight matrices →weight_decay = 0.01No-decay group: Bias terms, normalization parameters, embedding parameters →
weight_decay = 0.0Betas:
(0.9, 0.95)Epsilon:
1e-8Fused kernel: Enabled when CUDA available (not applicable on TPU)
Learning Rate Schedule
Phase 1 — Warmup (steps 0–1,621):
LR ramps linearly from 0 → 8e-4
Phase 2 — Cosine Decay (steps 1,621–32,425):
LR decays from 8e-4 → 0 following a cosine curve
Precision Stability Recipe
FP32 norm patching, critical layer wrapping & recipe summary
Training on TPU v5e in BF16 requires deliberate precision management to avoid gradient instabilities at 20M scale.
1. FP32 Normalization Layers (25 modules)
All RMSNorm modules are monkey-patched to compute in FP32:
def _fp32_norm_forward(hidden_states, *args, _orig=original_forward, **kwargs):
input_dtype = hidden_states.dtype
output = _orig(hidden_states.float().contiguous(), *args, **kwargs)
if torch.is_floating_point(output):
output = output.to(input_dtype)
return output
Count: 12 layers × 2 norms each + 1 final norm = 25 modules total.
2. FP32 Critical Layers (2 layers)
The first and last transformer layers run their entire forward pass in FP32:
Weights remain in their training dtype; inputs are cast to
.float()on entrytorch.amp.autocast("cuda", enabled=False)prevents re-downcasting
Rationale: Boundary layers — where embeddings project in and logits project out — are most sensitive to numerical precision. Wrapping them in FP32 provides a stable floor at minimal compute cost.
3. FP32 Attention Softmax — Skipped
Not applied. PyTorch SDPA handles softmax numerical stability internally and requires FP16/BF16 inputs for its optimized code paths on both CUDA and XLA.
Recipe Summary
| Technique | Count | Scope |
|---|---|---|
| FP32 norm modules | 25 | All RMSNorm layers |
| FP32 critical layers | 2 | First + last transformer layers |
| FP32 softmax modules | 0 | Skipped — SDPA incompatible |
Data Pipeline
Training data sources, curriculum design & preprocessing details
Training used a web/code/math curriculum with the following source mix:
| Source | Dataset | Ratio |
|---|---|---|
| Web | epfml/FineWeb-HQ (CC-MAIN-2024-51) |
75% |
| Code | Cyrile/dataset-the-stack-v2-dedup-sub (Python) |
15% |
| Math | HuggingFaceTB/finemath (finemath-4plus) |
10% |
Total tokens processed: ~8,500,000,000 (single epoch over source data, full 32,425-step run completed)
Curriculum Design
Training used a curriculum anneal over the final 15% of the token budget, upweighting code and math relative to web text. This front-loads web generalization while ensuring the model sees a higher concentration of structured/formal content near the end of training.
Text Preprocessing
def clean_text(text: str, preserve_linebreaks: bool = False) -> str:
text = unicodedata.normalize("NFKC", text)
text = text.replace("\\r\\n", "\\n").replace("\\r", "\\n")
if preserve_linebreaks:
lines = [line.rstrip() for line in text.splitlines()]
text = "\\n".join(lines).strip()
else:
lines = [line.strip() for line in text.splitlines() if line.strip()]
text = " ".join(lines)
text = " ".join(text.split())
return text
NFKC normalization maps visually-equivalent Unicode to canonical form
Linebreak preservation for code samples (not applicable to web/math)
Whitespace collapse for web/math text
Sequence Packing
Samples are packed into fixed 4,096-token blocks. Labels are identical to input_ids (causal LM objective). No cross-document attention masking is applied between packed samples — this is standard practice for web-text pretraining.
Weight Initialization
Initialization scheme & residual scaling code
def initialize_weights(model, std=0.02, num_hidden_layers=12):
layer_count = 12
residual_std = std / math.sqrt(2.0 * layer_count) # ≈ 0.00408
for name, module in model.named_modules():
if isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
elif isinstance(module, nn.Linear):
# Scaled-down std for output projections (residual path)
proj_std = residual_std if name.endswith(("o_proj", "down_proj")) else std
module.weight.data.normal_(mean=0.0, std=proj_std)
if module.bias is not None:
module.bias.data.zero_()
elif "rmsnorm" in type(module).__name__.lower():
if module.weight is not None:
module.weight.data.fill_(1.0)
Residual projections (
o_proj,down_proj) use scaled-down std (0.02 / sqrt(2 × 12) ≈ 0.00408) to prevent residual stream explosion at initialization, following the GPT-2 convention.All other Linear layers use
std=0.02.RMSNorm scales start at 1.0 (identity).
Evaluation & Results
Training loss & perplexity curves, family comparison, full checkpoint history
Training Loss Curve
Validation Perplexity Curve
Final result: best validation loss 2.6406 — perplexity 14.02.
Full Checkpoint History
| Step | Eval Loss | Perplexity | Notes |
|---|---|---|---|
| 1,000 | 4.8125 | ~122.9 | First best checkpoint |
| 2,000 | 4.2812 | ~72.4 | |
| 3,000 | 4.0625 | ~58.2 | |
| 4,000 | 3.6875 | ~39.9 | |
| 5,000 | 3.3906 | ~29.7 | |
| 6,000 | 3.2812 | ~26.6 | |
| 7,000 | 3.1719 | ~23.9 | |
| 8,000 | 3.0938 | ~22.1 | |
| 10,000 | 3.0156 | ~20.4 | |
| 11,000 | 2.9375 | ~18.9 | |
| 12,000 | 2.9219 | ~18.6 | |
| 13,000 | 2.8906 | ~18.0 | |
| 14,000 | 2.8750 | ~17.7 | |
| 15,000 | 2.8594 | ~17.4 | |
| 17,000 | 2.7812 | ~16.1 | |
| 18,000 | 2.7656 | ~15.9 | |
| 20,000 | 2.7031 | ~14.9 | |
| 23,000 | 2.6875 | ~14.7 | |
| 24,000 | 2.6719 | ~14.5 | |
| 29,000 | 2.6406 | 14.02 | Best checkpoint saved |
| Final (step 32,425) | 2.6406 | 14.02 | Full run completed |
Benchmark Results
All benchmarks are 0-shot unless otherwise noted. GSM8K was run at both 0-shot and 5-shot.
Stentor3-20M Benchmarks
Common Sense Reasoning
| Benchmark | Metric | Score |
|---|---|---|
| Winogrande | acc | 47.51% |
| HellaSwag | acc_norm | 26.17% |
| CommonsenseQA | acc | 19.57% |
World Knowledge & Science
| Benchmark | Metric | Score |
|---|---|---|
| ARC-Easy | acc_norm | 33.75% |
| ARC-Challenge | acc_norm | 22.53% |
| ARC Average | 28.14% | |
| OpenBookQA | acc_norm | 28.00% |
| TruthfulQA (MC2) | acc | 49.00% |
Multitask Knowledge (MMLU)
| Benchmark | Metric | Score |
|---|---|---|
| MMLU Average | acc | 22.95% |
| MMLU — Humanities | acc | [placeholder] |
| MMLU — Other | acc | [placeholder] |
| MMLU — Social Sciences | acc | [placeholder] |
| MMLU — STEM | acc | [placeholder] |
Mathematics
| Benchmark | Metric | Score |
|---|---|---|
| GSM8K (5-Shot) | exact_match | 0.15% |
| GSM8K (0-Shot) | exact_match | 0.76% |
Comparison against peer models, analysis & evaluation script
Comparison Against Peer Models
All models were evaluated by me under identical conditions — same script, same hardware (Kaggle 2× T4), same lm-eval-harness version, same batch size, same seed. All Stentor3-20M scores are 0-shot except GSM8K 5-shot. All peer model scores were produced under the same conditions. Portimbria-150M was re-evaluated with this script rather than reusing numbers from its model card; if any scores match the card, that means the model produced identical results under both evaluation setups.
Bold = best in column. Underline = 2nd best. Avg↑Random = mean of (score − baseline) across all 10 benchmarks; TruthfulQA baseline is the lowest AI score in the table (DistilGPT2, 0.4447) since it has no random-chance floor.
| Model | OpenBookQA (acc_norm) |
TruthfulQA (acc) |
Winogrande (acc) |
CommonsenseQA (acc) |
HellaSwag (acc_norm) |
ARC-Easy (acc_norm) |
ARC-Challenge (acc_norm) |
MMLU (acc) |
GSM8K 0-sh (flexible) |
GSM8K 5-sh (flexible) |
Avg↑Random |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stentor3-50M | 0.2640 | 0.5000 | 0.5028 | 0.1957 | 0.2638 | 0.2950 | 0.2406 | 0.2295 | 0.0061 | 0.0197 | 0.0123 |
| Stentor3-20M | 0.2800 | 0.4900 | 0.4751 | 0.1957 | 0.2617 | 0.3375 | 0.2253 | 0.2295 | 0.0076 | 0.0015 | 0.0109 |
| DistilGPT2 | 0.2680 | 0.4447 | 0.5043 | 0.1957 | 0.2751 | 0.3674 | 0.2287 | 0.2290 | 0.0129 | 0.0159 | 0.0147 |
| Pythia-31M | 0.2660 | 0.4695 | 0.4925 | 0.1957 | 0.2726 | 0.3426 | 0.2133 | 0.2292 | 0.0106 | 0.0121 | 0.0110 |
| Pythia-70M | 0.2520 | 0.4701 | 0.5185 | 0.1966 | 0.2741 | 0.3472 | 0.2193 | 0.2296 | 0.0061 | 0.0076 | 0.0127 |
| TinyStories-33M | 0.2480 | 0.4764 | 0.5241 | 0.1884 | 0.2716 | 0.2681 | 0.2329 | 0.2448 | 0.0083 | 0.0068 | 0.0075 |
| Portimbria-150M | 0.2460 | 0.4694 | 0.5288 | 0.1990 | 0.2745 | 0.3375 | 0.2253 | 0.2317 | 0.0061 | 0.0197 | 0.0144 |
| Random / Baseline | 0.250 | 0.4447 | 0.500 | 0.200 | 0.250 | 0.250 | 0.250 | 0.250 | — | — | — |
Analysis
OpenBookQA: Stentor3-20M leads at 0.2800 — highest in the table, notably beating all larger models including Stentor3-50M (0.2640) and Portimbria-150M (0.2460).
ARC-Easy: Stentor3-20M scores 0.3375, tied with Portimbria-150M for 3rd. Competitive for a 20M model.
TruthfulQA MC2: 0.4900, second only to Stentor3-50M (0.5000) — a strong result for this scale.
GSM8K: Near-zero across all models at this scale. Math reasoning at 20M parameters is essentially absent.
Where Stentor3-20M trails: Winogrande, HellaSwag, and GSM8K 5-shot. These are data-hungry tasks where the larger token budgets of Pythia (300B) and more general web-text coverage pull ahead. At 20M parameters and ~8.5B tokens (of which only ~6.4B are English web text), the model has less capacity and fewer training examples than the Pythia series for these pattern-completion tasks.
Overall: Stentor3-20M competes effectively with models 1.5–3× its size on several benchmarks. The Avg↑Random of 0.0109 places it near Pythia-31M (0.0110) despite having fewer parameters, fewer total training tokens, and a much smaller vocabulary.
Evaluation Setup (for Stentor3-20M)
Benchmarks were run on Kaggle with 2× Tesla T4 GPUs using the script below. Blocks are checkpointed to run_summary.json so a single failure never stops the rest, and completed blocks are skipped on re-run.
import os, sys, subprocess, json, time, re, threading, random, glob
from pathlib import Path
from datetime import datetime
# ── Environment & Networking ──────────────────────────────────────────────────
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_LAUNCH_BLOCKING"] = "0"
os.environ["NCCL_P2P_DISABLE"] = "1"
os.environ["NCCL_IB_DISABLE"] = "1"
os.environ["NCCL_SHM_DISABLE"] = "1"
os.environ["NCCL_SOCKET_IFNAME"] = "eth0"
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
# ── Install Deps ──────────────────────────────────────────────────────────────
print(f"\n{'='*75}", flush=True)
print(" [Setup] Installing dependencies (this may take a minute)...", flush=True)
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U",
"accelerate", "transformers", "hf_transfer",
"tokenmonster", "diskcache", "litellm[caching]",
"einops", "sentencepiece", "tiktoken"],
check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U",
"git+https://github.com/EleutherAI/lm-evaluation-harness.git"],
check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
print(" ✅ Dependencies installed.", flush=True)
# ── Patch lm_eval (TokenMonster fix) ─────────────────────────────────────────
get_path_cmd = [sys.executable, "-c", "import lm_eval.models.huggingface as hf; print(hf.__file__)"]
path_result = subprocess.run(get_path_cmd, capture_output=True, text=True)
if path_result.returncode == 0:
hf_model_file = path_result.stdout.strip()
try:
with open(hf_model_file, "r") as f:
code = f.read()
safe_code = code.replace(
"assert len(continuation_enc) > 0",
"continuation_enc = continuation_enc if len(continuation_enc) > 0 else [0]; assert len(continuation_enc) > 0"
)
with open(hf_model_file, "w") as f:
f.write(safe_code)
print(" ✅ Patched lm_eval safely.", flush=True)
except Exception as e:
print(f" ⚠️ Could not patch lm_eval: {e}", flush=True)
else:
print(" ⚠️ Could not locate lm_eval file.", flush=True)
# ── HF Token Auth ─────────────────────────────────────────────────────────────
hf_token = os.environ.get("HF_TOKEN")
if "KAGGLE_KERNEL_RUN_TYPE" in os.environ:
try:
from kaggle_secrets import UserSecretsClient
hf_token = UserSecretsClient().get_secret("HF_TOKEN")
except Exception:
pass
if hf_token:
os.environ["HF_TOKEN"] = hf_token
subprocess.run(["huggingface-cli", "login", "--token", hf_token],
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
print(" ✅ HF authenticated.", flush=True)
else:
print(" ⚠️ No HF_TOKEN found.", flush=True)
# ── Config ────────────────────────────────────────────────────────────────────
MODEL = "StentorLabs/Stentor3-20M"
DTYPE = "float16"
BATCH = "32"
SEED = 42
OUT = "./results"
MODEL_ARGS = f"pretrained={MODEL},dtype={DTYPE},trust_remote_code=True"
SUMMARY_PATH = f"{OUT}/run_summary.json"
# ── Pre-download Model ────────────────────────────────────────────────────────
print(f" [Setup] Pre-downloading {MODEL}...", flush=True)
pre_dl_cmd = [
sys.executable, "-c",
f"from transformers import AutoConfig, AutoTokenizer; "
f"AutoConfig.from_pretrained('{MODEL}', trust_remote_code=True); "
f"AutoTokenizer.from_pretrained('{MODEL}', trust_remote_code=True)"
]
subprocess.run(pre_dl_cmd, capture_output=True, text=True)
print(" ✅ Model ready.", flush=True)
print(f"{'='*75}\n", flush=True)
# ── Evaluation Blocks ─────────────────────────────────────────────────────────
BLOCKS = [
("block1", "OpenBookQA · TruthfulQA", "openbookqa,truthfulqa_mc2", 0, None),
("block2", "Winogrande · CommonsenseQA", "winogrande,commonsense_qa", 0, None),
("block3", "HellaSwag", "hellaswag", 0, None),
("block4", "ARC-Easy · ARC-Challenge", "arc_easy,arc_challenge", 0, None),
("block5", "MMLU", "mmlu", 0, None),
("block6", "GSM8K (0-shot)", "gsm8k", 0, None),
("block7", "GSM8K (5-shot)", "gsm8k", 5, None),
]
LAUNCH_BASE = [
"accelerate", "launch",
"--multi_gpu",
"--num_processes=2",
"--mixed_precision=fp16",
]
# ── Helpers ───────────────────────────────────────────────────────────────────
ANSI_ESCAPE = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])')
DEBUGGER_NOISE = re.compile(r"(Debugger warning|frozen modules|PYDEVD|make the debugger|pass -X|Note: Debugging)")
log_lock = threading.Lock()
log_state = {"last_print": 0.0}
def ts():
return datetime.now().strftime("%H:%M:%S")
def stream(proc):
def _read(pipe):
for raw in iter(pipe.readline, ""):
line = ANSI_ESCAPE.sub('', raw.strip())
if not line or DEBUGGER_NOISE.search(line):
continue
is_progress = any(x in line for x in ["%|", "it/s", "s/it", "MB/s", "GB/s", "Block"])
with log_lock:
now = time.time()
if is_progress:
if now - log_state["last_print"] > 2.0:
print(f" [{ts()}] {line}", flush=True)
log_state["last_print"] = now
else:
print(f" [{ts()}] {line}", flush=True)
t1 = threading.Thread(target=_read, args=(proc.stdout,), daemon=True)
t2 = threading.Thread(target=_read, args=(proc.stderr,), daemon=True)
t1.start(); t2.start()
proc.wait()
t1.join(); t2.join()
def extract_metrics(output_dir):
extracted = {}
for file_path in glob.glob(f"{output_dir}/**/*.json", recursive=True):
try:
with open(file_path) as f:
data = json.load(f)
results = data.get("results", data.get("metrics", {}))
for task, metrics in results.items():
if isinstance(metrics, dict):
score = (
metrics.get("acc_norm") or
metrics.get("acc") or
metrics.get("exact_match") or
metrics.get("word_perplexity")
)
if score is not None:
extracted[task] = round(score, 4)
except Exception:
pass
return extracted
# ── Run ───────────────────────────────────────────────────────────────────────
Path(OUT).mkdir(parents=True, exist_ok=True)
summary = {}
if os.path.exists(SUMMARY_PATH):
try:
summary = json.load(open(SUMMARY_PATH))
except Exception:
summary = {}
for i, (name, title, tasks, fewshot, extra) in enumerate(BLOCKS, 1):
if summary.get(name, {}).get("status") == "ok":
print(f"⏭️ Skipping {title} (already done)")
continue
print(f"\n{'='*75}")
print(f" BLOCK {i}/{len(BLOCKS)} — {title}")
print(f"{'='*75}\n")
port = str(29500 + random.randint(1, 999))
cmd = LAUNCH_BASE + [
"--main_process_port", port,
"-m", "lm_eval",
"--model", "hf",
"--model_args", MODEL_ARGS,
"--batch_size", BATCH,
"--seed", str(SEED),
"--tasks", tasks,
"--num_fewshot", str(fewshot),
"--output_path", f"{OUT}/{name}_lmeval",
]
t0 = time.time()
try:
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
stream(proc)
elapsed = round((time.time() - t0) / 60, 1)
if proc.returncode == 0:
metrics = extract_metrics(f"{OUT}/{name}_lmeval")
summary[name] = {"status": "ok", "elapsed_min": elapsed, "metrics": metrics}
print(f"✅ Done in {elapsed} min")
else:
summary[name] = {"status": "failed", "exit_code": proc.returncode, "elapsed_min": elapsed}
print(f"❌ Failed ({proc.returncode})")
except Exception as e:
summary[name] = {"status": "failed", "error": str(e)}
with open(SUMMARY_PATH, "w") as f:
json.dump(summary, f, indent=2)
time.sleep(2)
# ── Final Summary ─────────────────────────────────────────────────────────────
passed = sum(1 for v in summary.values() if v.get("status") == "ok")
print(f"\n{'='*75}")
print(f" DONE — {passed}/{len(BLOCKS)} runs succeeded")
print(f"{'='*75}")
for k, v in summary.items():
icon = "✅" if v.get("status") == "ok" else "❌"
print(f" {icon} {k:<15} {v.get('elapsed_min', '—')} min")
print(f"\nSummary saved → {SUMMARY_PATH}\n")
if any(v.get("status") == "failed" for v in summary.values()):
sys.exit(1)
Model Outputs
10 real generation samples — 4 average, 3 cherry-picked good, 3 cherry-picked bad
All generations: top_p=0.85, temperature=0.6, max_new_tokens=200, repetition_penalty=1.25. Outputs end at the 200-token limit — hitting the budget mid-output is expected behavior and is a limit of the token budget, not a flaw of the AI. The first 4 are average responses (first 4 collected with no cherry-picking), the next 3 are cherry-picked good, and the last 3 are cherry-picked bad.
Average responses
1. Prompt: "Exercise can improve strength and energy levels. Many people walk, run, or play sports every week. Regular physical activity is often connected to better health. Exercise is important for"
"making, which helps in maintaining the balance of muscle and body temperature. The use of the physical activity in different parts of the body is very effective in preventing any other diseases. The use of the physical activity in different parts of the body is very effective in the treatment of pain. The technique involves the use of the physical activity, which is essential for the maintaining the balance of muscles and bones. It can be used as a tool for controlling muscles and muscles, as it helps in maintaining the balance of muscles and bones. This can be done by using a physical activity in different parts of the body. The physical activity is simple and can be done by doing a small amount of exercise. It can be used to control muscle mass, which also helps in maintaining muscle mass. The physical activity in these parts of the body is more effective in reducing muscle mass. The physical activity in these parts of the body"
✅ Good: Stays broadly in the health/exercise domain. Produces grammatical English throughout. Correctly associates exercise with muscles, balance, and pain management.
❌ Bad: Repetitive — "the physical activity in different parts of the body" appears four times. Begins mid-sentence ("making,"). Wanders from exercise to pain treatment to muscle control without meaningful progression.
2. Prompt: "The internet is used for communication, entertainment, and education. Many students read articles and watch videos online every day. Websites provide information on many different topics. People use the internet to"
"When you read your website, you are not aware of any content that comes from textual or other sources. The Internet is used to communicate with users. Some of the most common types of Internet are: - Today, the Internet has become a popular choice for people seeking information about their web sites. This type of Internet is used by people who can help them understand what they want them to do in their domains. This is why it is useful in searching for relevant information about the internet. - Today, it can be useful for people seeking information about their web sites. - Today, it is also useful for individuals to seek information about their web sites. For example, the Internet can help you find information about your web sites, and get information about your web sites. - Today, this type of Internet is used by people who know who they are and how they want to do things. - Today, it can be useful for those who have access"
✅ Good: Recognizes the internet domain. Produces a list format with - Today, bullets, showing awareness of document structure. Grammatically clean.
❌ Bad: "Today" appears six times in a near-identical structure — a repetition pattern rather than meaningful variation. "Web sites" used repeatedly in a narrow, circular way. The opening sentence ("When you read your website") redirects the prompt oddly.
3. Prompt: "Music is listened to in many countries around the world. Different styles of music are popular with different groups of people. Songs are often used for relaxation, celebration, or entertainment. Music is popular because"
"The use of the word 'song' is a powerful tool that can help you understand and appreciate the meaning of the words. The use of the word 'song' is not just a means to add a song to your musical style. It is also an effective tool that can help you to find a song for your musical style. It also allows you to choose from various forms of music that can be used to create your music. Here are a few tips to help you visualize these words: 1. Writing songs for musical instruments: When you are writing songs for a certain time, you will be able to read it in one place. You will need to have some time to write music. 2. Have a good reading plan. 3. Have a list of all the musical instruments that you use in your music. 4. Have a list of all the musical instruments that you use in your music. 5. Have a list of all the instruments that you use in your music. 6. Have a great review. 7. Have a"
✅ Good: Stays on the music topic throughout. Transitions into a numbered how-to list, which is a coherent document structure. Points 1–3 are at least related to music.
❌ Bad: Points 3 and 4 are identical ("Have a list of all the musical instruments that you use in your music"). The opening sentence repeats "The use of the word 'song'" twice in near-identical form.
4. Prompt: "Books have been used for learning and entertainment for many years. Libraries contain information about science, history, and literature. Reading can improve vocabulary and understanding. Books are useful because"
"lines can help to identify the words in the text. A variety of sources include: - Analysis of the words in the text - Analyze the word from ancient times - Recommendate the term in a sentence by using a combination of different languages such as ALTR, ASP, or TEMP. - Experiment with various languages such as TIA, IBM, etc. - Examples of a specific language that has multiple meanings in one language. - Explore the meaning of the word. - Description of phrases in other languages. The words are also used when speaking English. - Scala or Scala: A word that is often used to refer to a particular language. The word is usually used to describe the language. - Subject: A word that is often used to describe the language. - Outside: A word that is often used to describe the language. The"
✅ Good: Produces a consistent list format. Grammatically intact.
❌ Bad: Completely off-topic — the model pivoted from books to a glossary of programming/technical language acronyms (ALTR, ASP, TIA, IBM, Scala). "Recommendate" is not a word. The prompt is about books and reading; the output has nothing to do with either.
Cherry-picked good
5. Prompt: "Healthy food provides vitamins and nutrients for the body. Fruits, vegetables, and proteins are common parts of balanced meals. Drinking enough water is also important for daily health. A healthy diet usually includes"
"A food that contains essential nutrients, such as vitamin B6, minerals, omega-3 fatty acids, and nuts, is high in fiber, vitamin C, and minerals. It is important to keep in mind that the supplement is only as simple as possible. Consuming a healthy diet with low-calorie-rich foods can help you achieve your goals. However, it is recommended to maintain proper calories in your diet for healthy weight loss. It can be convenient to enhance the overall health of your body, reducing your risk of heart disease. It is also important to keep in mind that low-calorie foods are great sources of vitamins and nutrients for healthy weight loss. To ensure that your body has regular exercise, you should make sure that you have healthy fats and nutri"
✅ Good: Correctly names real nutrients (vitamin B6, omega-3, fiber, vitamin C). Stays on the diet/health topic throughout. Produces a more coherent flow than the average samples.
❌ Bad: The contradiction "low-calorie-rich foods" is semantically broken.
6. Prompt:
import random
import numpy as np
numbers = [1, 2, 3]
array.sort_values(by="r")
for i in range(len(numbers)):
print('Number of positive integers:', number[0] + 1)
print(numbers[i] - 1)
print(numbers[i] - 1)
print()
#print(numbers)
for i in range(len(numbers)):
#print(numbers[i])
print(numbers[i] - 1)
print(numbers[i] - 1)
#print(numbers[i] - 1)
print(numbers[i] - 1)
if __name__ == '__main__':
print("Search for "+ str(numbers)) transformation
✅ Good: The model correctly continues the code prompt with Python syntax. It uses the numbers variable from the prompt, writes a valid for loop structure, uses range(len(...)), includes # comments, and correctly formats a if __name__ == '__main__': block. This is recognizable, structured Python for a model trained on ~1.2B code tokens.
❌ Bad: array.sort_values(by="r") references a variable that doesn't exist. number[0] should be numbers[0]. The inner loop is a copy of the outer loop with no meaningful difference. transformation at the end is a stray token. The code would not run correctly, but the structure is there.
7. Prompt: ";kd.ks'xsdlkfa'jmclsgdf'z.s" (intentionally nonsensical to test gibberish continuation)
"7. Use data to determine the relationship between the variables and data. 8. Problem solving. 9. Use data to identify the relationship between two variables. 10. Problem solved."
✅ Good: Rather than continuing the gibberish, the model pivoted to producing a coherent numbered list — grammatically clean and internally consistent. It is not instruction-tuned so there was no expectation it would identify the input as gibberish; the goal was coherent output regardless of input, and it delivered that. The response is short and self-contained.
❌ Bad: The numbered list starts at 7, implying a continuation of something that doesn't exist. The content (data analysis steps) has nothing to do with the input and reflects the code/data corpus bleeding through, not actual understanding of the prompt.
Cherry-picked bad
8. Prompt: "Hi!!!"
"UnitValues. The value of the data can be calculated by dividing the number of values by the number of values and the number of values for the variable. For example, the data can be calculated using the following formula: The value of each value le is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable"
✅ Good: Grammatically coherent for the first two sentences.
❌ Bad: "Hi!!!" triggers a data-analysis response with no connection to the prompt. Enters a repetition loop ("The value of the variable is given by:" repeated six times). "value le" is a broken word. This is the clearest example of how short, ambiguous prompts destabilize the model.
9. Prompt: "1 + 1 ="
"# of variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable he"
✅ Good: Recognizes a math context and responds in a math/formula register.
❌ Bad: Pure repetition loop — "The value of the variable is given by:" repeated twelve times. Cannot do arithmetic. The response has no relationship to the answer 2.
10. Prompt: "Import"
"bed., a randomly_value=True) bed., a , ❑❑❑❑❑❑❑❑❑❑ > 0.1.85, ❑❑❑❑❑ [], ❑❑❑❑❑ @nyagridcore,❑❑❑❑❑ #####################"
✅ Good: The model correctly interprets "Import" as a code token and responds with code-like output. randomly_value=True and @nyagridcore are code-shaped, even if invalid.
❌ Bad: Contains multiple ❑ (Unicode replacement character U+FFFD) sequences — the most visible occurrence of this artifact in testing. The output is not valid code or valid text. "0.1.85" is not a valid version string. "bed." is a stray token. This is the model's worst-case output: token artifacts, broken syntax, and garbage characters all at once.
Training Dynamics
Step-by-step training phase breakdown & throughput details
The training run processed approximately 8.5 billion tokens across a single epoch (epoch 0), running for 32,425 optimizer steps — the full planned run, completed without hitting the session deadline.
Early training (steps 0–1,621, warmup phase): LR ramped linearly from 0 to peak. Loss dropped quickly from above 5.0. First best checkpoint recorded at step 1,000 (eval loss 4.8125).
Mid training (steps 1,621–32,425, cosine decay phase): Smooth and consistent loss reduction. Gradient norms were well-behaved in the 0.34–0.60 range for most steps, with one notable spike (step 13,200: grad_norm 6.625 — recovered cleanly in the next logged step with no apparent impact on the loss trajectory) and a late spike at step 29,700 (grad_norm 13.8125 — also recovered cleanly). New best checkpoints recorded at steps 1,000 / 2,000 / 3,000 / 4,000 / 5,000 / 6,000 / 7,000 / 8,000 / 10,000 / 11,000 / 12,000 / 13,000 / 14,000 / 15,000 / 17,000 / 18,000 / 20,000 / 23,000 / 24,000 / 29,000.
Run completion: The full 32,425-step run completed within the session window. The best model (step 29,000, eval loss 2.6406) had already been saved. Final eval loss 2.6406, perplexity 14.02.
Throughput: 304,000 global tokens/sec average (38,000 per chip), with a brief XLA warmup window reset at step 600.
Total wall-clock time: ~8.15 hours (29,323s total, including eval, save, and subprocess overhead).
Use Cases & Intended Uses
| Use Case | Suitability | Notes |
|---|---|---|
| Studying transformer training dynamics at 20M scale | ✅ High | Full architecture, hyperparameters, and training curves published |
| Benchmarking 4K-context inference latency | ✅ High | Realistic long-context workload |
| Quantization / conversion pipeline testing | ✅ High | Standard architecture, no custom ops |
| Teaching material for LLM courses | ✅ High | Fully documented, reproducible from scratch |
| Edge deployment experiments | ✅ High | ~41MB in FP16; extremely feasible on modern edge hardware |
| Domain-specific fine-tuning research | ✅ High | Standard transformers; fine-tune like any LLaMA model |
| Speculative decoding draft model | ✅ Medium | Fast enough to draft; uses rare TokenMonster-4k vocab so finding a vocab-matched target is difficult — cross-vocabulary draft is possible but lowers acceptance rate |
| Text continuation / creative writing | ✅ Medium | Good fluency; limited thematic fidelity |
| Code completion prototyping | ❌ Not suitable | Generates code-shaped output but it doesn't work — see Honest Notices |
| Factual Q&A | ❌ Not suitable | Unreliable world knowledge at this scale |
| Production deployment | ❌ Not suitable | No safety tuning |
| Non-English text | ❌ Not suitable | Training data is English-heavy |
| Instruction following | ❌ Not suitable | Base model only |
Out-of-Scope Uses
Any user-facing application — No safety filtering, no alignment, no factual reliability.
Medical, legal, or financial advice — Cannot reason reliably over specialized knowledge.
Generating content about real people — Will fabricate.
Automated content pipelines — Output quality is insufficient for unreviewed publication.
Instruction following — This is a base next-token predictor.
Ethical Considerations & Societal Impact
Data biases, safety considerations & societal impact
Inherited Data Biases
Trained on FineWeb-HQ, The Stack v2 dedup subset, and FineMath-4+ — all derived from web-scraped data. The model inherits:
Western-centric perspective — English-language web text skews toward Western viewpoints and cultural contexts.
English monolingualism — TokenMonster english-4096 is optimized for English. Other languages will produce high fertility and poor quality.
Demographic underrepresentation — Groups underrepresented in English web text will be underrepresented in outputs.
Code ecosystem bias — The code training data covers Python only (
Cyrile/dataset-the-stack-v2-dedup-sub). Other languages are not represented.
No Safety Tuning
No RLHF, DPO, constitutional AI, or content filtering of any kind has been applied.
Positive Aspects
Democratizing AI research — Trained entirely on free Kaggle TPU compute.
Full transparency — Complete training hyperparameters, architecture, and training curves published.
Minimal environmental footprint — ~8 hours of TPU compute is negligible versus large-scale pretraining runs.
Inference Guide
CPU inference (INT8) & GPU inference (FP16) code
CPU Inference (INT8 Dynamic Quantization)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor3-20M")
tokenizer = AutoTokenizer.from_pretrained(
"StentorLabs/Stentor3-20M",
trust_remote_code=True,
)
# Dynamically quantize for CPU
model_int8 = torch.quantization.quantize_dynamic(
model.cpu(),
{torch.nn.Linear},
dtype=torch.qint8,
)
inputs = tokenizer("The laws of physics state that", return_tensors="pt")
with torch.inference_mode():
output = model_int8.generate(**inputs, max_new_tokens=80, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))
GPU Inference (FP16)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
torch_dtype=torch.float16,
device_map="cuda",
).eval()
tokenizer = AutoTokenizer.from_pretrained(
"StentorLabs/Stentor3-20M",
trust_remote_code=True,
)
def generate(prompt, max_new_tokens=100, temperature=0.8, top_p=0.9):
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
output = model.generate(
input_ids,
attention_mask=torch.ones_like(input_ids),
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_p=top_p,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print(generate("Once upon a time in a distant kingdom"))
🚀 Free Inference — Try It Now
No GPU, no setup, no API key required.
StentorLabs hosts a free demo space for all Stentor models:
🔗 https://huggingface.co/spaces/StentorLabs/StentorLabs-demo_space
1. Core Functional Features
⚡ Generate Tab The primary interface for standard text completion.
- Model Selection: Allows switching between the flagship Portimbria-150M, various Stentor3 and Stentor2 versions, and base Stentor models.
- Generation Presets: Includes quick-toggle chips (Creative, Balanced, Focused) that automatically adjust Temperature, Top-P, and Max Tokens.
- Multi-Response Mode: Users can generate up to 5 sequential completions for the same prompt to observe the variance in the model's logic.
- Live Metrics: Displays real-time stats including total tokens generated, elapsed time, and speed in tokens per second (t/s).
- Persistent History: Maintains a visual log of recent generations with timestamps and parameters.
🔬 Token Explorer Tab A unique diagnostic feature for understanding model internal confidence.
- Probability Heatmap: Generates text and color-codes each token based on the model's confidence level (e.g., Green for ≥80%, Red for <35%).
- Alternative Inspection: Hovering over any generated token reveals the top 8 alternatives the model was considering at that specific moment, along with their respective probabilities.
🌡 Temperature Sweep Tab A visualization tool to study how "creativity" parameters affect output.
- Parallel Generation: Runs the same prompt across 2 to 3 different temperature settings simultaneously.
- Divergence Analysis: Allows users to see exactly where a model stays rigid (low temp) vs. where it becomes chaotic or creative (high temp).
🏟 Arena Tab A competitive benchmarking environment for comparing Stentor models against popular baselines.
- Match Modes:
- Show: Open comparison where model names are visible.
- Blind: Identity is hidden until a vote is cast to prevent brand bias.
- Random: Pairs a random Stentor model against a baseline (like Pythia, OPT, or GPT-2).
- Fair Matchmaking: Implements logic to pair models of similar parameter counts (e.g., 20M vs 31M) to ensure a balanced comparison.
- Voting System: Users can vote for the better response, a tie, or "both bad," which reveals the hidden identities.
💬 Chat Tab An interface for multi-turn dialogue.
- Custom Prompting: Uses a tag-based formatting style to help the base models maintain conversational context despite not being explicitly instruction-tuned in many cases.
- Session State: Uses gr.State to track message history across the conversation.
Supported Models
The app maintains a library of: - Internal Models: Portimbria-150M, Stentor3 (50M/20M), Stentor2 (30M/12M), and original Stentor (30M/12M + Instruct). - Arena Baselines: DistilGPT2, Pythia (14M through 160M), gpt2 small, SmolLM2-135M, OPT-125M, and GPT-Neo.
Quantization
FP16, BF16 & 4-bit (bitsandbytes) quantization code
FP16 (GPU)
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
torch_dtype=torch.float16,
)
BF16
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
torch_dtype=torch.bfloat16,
)
4-bit (bitsandbytes)
pip install bitsandbytes accelerate
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
quantization_config=bnb_config,
device_map="auto",
)
🌍 Community Contributions — Build on This Model
Stentor3-20M is built by an independent solo researcher, not a large corporate AI lab. That means it doesn't have teams of engineers running downstream experiments — that's where you come in. This model is Apache 2.0 licensed and is explicitly intended to be modified, extended, and redistributed.
Here are things StentorLabs actively encourages the community to try:
Fine-tune it on your domain — instruction tuning, domain adaptation, RLHF, DPO, anything goes
Quantize it — 4-bit, 8-bit, GGUF, GPTQ, AWQ, ONNX, all highly encouraged
Convert it to other formats — GGUF for llama.cpp, ONNX for deployment, CoreML for Apple Silicon
Run LoRA or QLoRA to adapt it cheaply on consumer hardware
Use it for speculative decoding with a larger Llama-family target
Benchmark it on other benchmarks not covered here, or re-run the evaluation script in a Kaggle 2×T4 environment to reproduce and verify the results
Publish your work — fine-tunes, quantized versions, adapters, research findings, derivative models, anything
If you build something with Stentor3-20M, please share it on HuggingFace and tag or link back to the base model. Every community result makes this model more useful for everyone.
LoRA / QLoRA Starter Configuration
Starter config, recommended hyperparameters & QLoRA note
If you haven't fine-tuned a Llama-family model before, here's a proven starting point for Stentor3-20M:
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor3-20M")
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # LoRA rank — try 32 if underfitting
lora_alpha=32, # alpha = 2× rank is a reliable default
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: ~0.8M || all params: ~49.4M || trainable%: ~1.6%
Recommended fine-tuning hyperparameters:
| Hyperparameter | Value | Notes |
|---|---|---|
| Learning rate | 3e-5 – 1e-4 | Small models are sensitive to high LRs during SFT |
| Optimizer | AdamW | betas=(0.9, 0.999), eps=1e-8 |
| LR scheduler | Cosine with warmup | ~5% warmup steps |
| Batch size | 8–32 | Per device; scale with gradient accumulation |
| Epochs | 2–5 | Watch for overfitting after epoch 2 |
| Max sequence length | 512–2048 | Up to 4096 is supported |
For QLoRA (4-bit quantized base + LoRA adapters on top), add BitsAndBytesConfig(load_in_4bit=True) when loading the base model — the LoRA config and training hyperparameters above apply unchanged. This lets you fine-tune on a single consumer GPU with ~4–6 GB VRAM.
Format Conversion
Convert to GGUF (llama.cpp) & ONNX
Convert to GGUF (llama.cpp)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && pip install -r requirements.txt
huggingface-cli download StentorLabs/Stentor3-20M --local-dir stentor3-20m
python convert_hf_to_gguf.py stentor3-20m/ \\
--outfile stentor3-20m.gguf \\
--outtype f16
./llama-quantize stentor3-20m.gguf stentor3-20m-q4_k_m.gguf q4_k_m
./llama-cli -m stentor3-20m-q4_k_m.gguf -p "The history of computing" -n 100
Convert to ONNX
pip install optimum[exporters]
optimum-cli export onnx \\
--model StentorLabs/Stentor3-20M \\
--task text-generation-with-past \\
stentor3-20m-onnx/
Speculative Decoding
Stentor3-20M is a compact draft model that is fast and small enough to meaningfully accelerate a larger target model's inference. The caveat is its tokenizer: TokenMonster-4k is not used by any mainstream LLM, so it is not vocabulary-compatible with Llama/Mistral/GPT-family models (which use 32K+ BPE vocabularies). Native speculative decoding (where the draft and target share token IDs) requires a vocabulary-compatible target.
Speculative decoding options & vocabulary compatibility notes
Option A — Use with a vocabulary-matched target (recommended)
For maximum acceptance rates, you need a target model that either uses the same TokenMonster 4K vocab or is fine-tuned from Stentor3-20M itself (e.g. an instruction-tuned version). When both models share token IDs, HuggingFace assisted generation works directly:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
draft_model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
torch_dtype=torch.float16,
).to("cuda")
# Target must share the same TokenMonster 4K vocabulary
target_model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/YourVocabCompatibleModel",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"StentorLabs/Stentor3-20M",
trust_remote_code=True,
)
inputs = tokenizer("Explain the concept of recursion:", return_tensors="pt").to("cuda")
outputs = target_model.generate(
**inputs,
assistant_model=draft_model,
do_sample=True,
max_new_tokens=200,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Option B — Cross-vocabulary draft (lower acceptance rate)
You can still use Stentor3-20M as a fast heuristic draft model alongside a BPE-vocabulary target (e.g. Llama-3.2-1B) using custom speculative decoding logic, but HuggingFace's built-in assistant_model will not work out of the box because token IDs will not align. This requires a vocabulary remapping layer or a custom decoding loop and is an active research area.
Vocabulary compatibility summary: Stentor3-20M uses a compact 4,096-token TokenMonster vocabulary. This is fundamentally different from the 32K BPE vocabularies used by Llama, Mistral, GPT-2, and most other public models. Built-in HuggingFace speculative decoding only works between models sharing an identical token-to-ID mapping.
Related Work
Models benchmarked alongside Stentor3-20M & StentorLabs model generations
Comparable Models (Benchmarked Alongside This Model)
| Model | Parameters | Vocab | Context | Notes |
|---|---|---|---|---|
| Stentor3-20M (this model) | 20M | 4K TokenMonster | 4,096 | ~8.5B tokens, TPU v5e-8 |
| Stentor3-50M | 49M | 4K TokenMonster | 4,096 | Sibling model in Gen 3 lineage |
| DistilGPT2 | 82M | 50K BPE | 1,024 | Distilled GPT-2; 40GB WebText |
| Pythia-70M | 70M | 50K BPE | 2,048 | EleutherAI; 300B Pile tokens |
| Pythia-31M | 31M | 50K BPE | 2,048 | EleutherAI; 300B Pile tokens |
| TinyStories-33M | 33M | 50K BPE | 1,024 | Trained on synthetic stories; ~9.3B tokens |
| Portimbria-150M | 151M | 32K BPE | 4,096 | StentorLabs flagship |
| Stentor2-30M (Gen 2) | 30M | 8K TokenMonster | 1,024 | Second generation StentorLabs |
| Stentor2-12M (Gen 2) | 12M | 8K TokenMonster | 1,024 | Second generation StentorLabs |
Related Research Papers
| Paper | Relevance |
|---|---|
| Scaling Laws — Kaplan et al., 2020 | Informs token budget decisions |
| Chinchilla — Hoffmann et al., 2022 | 8.5B tokens for 20M params is ~425× (well above Chinchilla optimal) |
| GQA — Ainslie et al., 2023 | Grouped Query Attention used in this model |
| RoPE — Su et al., 2021 | Positional encoding |
| LLaMA — Touvron et al., 2023 | Architecture basis |
| Pythia — Biderman et al., 2023 | Comparable small-model scaling study |
| Speculative Decoding — Leviathan et al., 2023 | Primary deployment use case |
Environmental Impact
Hardware, duration & estimated carbon
| Factor | Value |
|---|---|
| Hardware | Google Cloud TPU v5e-8 |
| Active Training Duration | ~8.15 hours (29,323s total) |
| Cloud Provider | Google (via Kaggle free tier) |
| Compute Region | United States |
| Estimated Carbon | Minimal (< 1.0 kg CO₂e estimated) |
The TPU v5e is substantially more energy-efficient per FLOP than comparable GPU hardware. Running on Kaggle's free tier also means no dedicated data center allocation beyond what Kaggle already operates.
Citation
BibTeX
@misc{izumoto2026stentor3120m,
title = {Stentor3-20M},
author = {Kai Izumoto},
year = {2026},
publisher = {StentorLabs},
howpublished = {\\url{https://huggingface.co/StentorLabs/Stentor3-20M}},
note = {~20M parameter LlamaForCausalLM base model with GQA trained from scratch
on ~8.5B tokens (FineWeb-HQ, Cyrile/dataset-the-stack-v2-dedup-sub, FineMath-4+)
using a Google Cloud TPU v5e-8 on Kaggle free compute. 4096-token context,
TokenMonster-4k vocabulary (non-tied embeddings). Apache 2.0 license.
Architecture informed by The Depth Delusion (Wu et al., 2026) and
The Depth Myth (Kai Izumoto, 2026).}
}
Model Card Contact
Questions, benchmarks, or feedback: StentorLabs@gmail.com or open a discussion.
Made with ❤️ by StentorLabs
Democratizing AI through accessible, efficient models — trained on free compute, shared with everyone.
- Downloads last month
- 181
Datasets used to train StentorLabs/Stentor3-20M
HuggingFaceTB/finemath
Cyrile/dataset-the-stack-v2-dedup-sub
Papers for StentorLabs/Stentor3-20M
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
LLaMA: Open and Efficient Foundation Language Models
Fast Inference from Transformers via Speculative Decoding
Training Compute-Optimal Large Language Models
Evaluation results
- Best Validation Loss on FineWeb-HQ (validation split)self-reported2.641
- Best Perplexity on FineWeb-HQ (validation split)self-reported14.020

