Instructions to use StentorLabs/Stentor3-20M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use StentorLabs/Stentor3-20M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="StentorLabs/Stentor3-20M")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor3-20M")
model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor3-20M")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use StentorLabs/Stentor3-20M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "StentorLabs/Stentor3-20M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StentorLabs/Stentor3-20M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/StentorLabs/Stentor3-20M

SGLang

How to use StentorLabs/Stentor3-20M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "StentorLabs/Stentor3-20M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StentorLabs/Stentor3-20M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "StentorLabs/Stentor3-20M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StentorLabs/Stentor3-20M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use StentorLabs/Stentor3-20M with Docker Model Runner:
```
docker model run hf.co/StentorLabs/Stentor3-20M
```

Stentor3-20M

🔬 Research Artifact & Base Language Model. Stentor3-20M is a next-token predictor — not a chat assistant. It has no safety tuning and should not be deployed in user-facing applications without fine-tuning first. It is, however, a high-quality open foundation: fine-tune it, quantize it, convert it, distill from it, run LoRA on it, adapt it to your domain, or build anything else you can imagine — and please publish your results! See Intended Uses for details.

💡 Built by a solo 14-year-old developer, on a laptop, for $0. Every model StentorLabs has released — including this one — was conceived, designed, coded, and trained without a budget, a team, a GPU cluster, or institutional support. The total cost of producing Stentor3-20M was zero dollars, using free Kaggle TPU quota and publicly available datasets. This is what democratized AI research looks like.

What Is This?

Stentor3-20M is the sibling model to Stentor3-50M in the Stentor3 family — a new generation of StentorLabs models trained on a full web/code/math curriculum with a 4096-token context window.

At ~20M parameters, Stentor3-20M is a base causal language model trained entirely from scratch on free-tier Kaggle compute using a Google Cloud TPU v5e-8 (eight chips). It was trained for a full 32,425 steps on approximately 8.5 billion tokens drawn from a web/code/math curriculum, completing epoch 0 in ~8.1 hours.

Like all StentorLabs models, this is a base next-token predictor, not a chat assistant. It will not reliably follow instructions, has no safety tuning, and is best suited for research, prototyping, speculative decoding, and infrastructure experiments.

The key architectural features: Grouped Query Attention (10 query heads, 2 KV heads — 5:1 ratio), a compact TokenMonster vocabulary (4,096 tokens), and a 4096-token context window.

GQA at this scale worked cleanly: 10 query heads with 2 KV heads delivered an 80% KV cache reduction at inference time with no training instability observed.

The Stentor3 Name

Why "Stentor3"?

Stentor is a genus of large, trumpet-shaped single-celled organisms — simple in structure, surprisingly capable for their scale. These organisms are among the largest known unicellular life forms, reaching up to 2mm in length, yet they operate as a single cell with no specialized organs — a fitting metaphor for a compact language model that punches far above its weight class. Like the stentor, which can regenerate from fragments and adapt to changing environments, this model is designed to be adaptable and resilient, serving as a foundation that can be fine-tuned, quantized, distilled, and adapted across countless downstream tasks. The name has anchored nearly every StentorLabs model since the beginning: compact, functional, and punching above its weight.

📋 Table of Contents

What Is This?
The Stentor3 Name
Model Architecture
Architectural Foundations: The Depth Delusion & Width vs. Depth Tradeoffs
Head-to-Head: StentorLabs Model Family
Quick Start
Memory Requirements
Important Limitations
Honest Notices
Training Infrastructure
Training Hyperparameters — Complete Reference
Precision Stability Recipe
Data Pipeline
Weight Initialization
Evaluation & Results
Benchmark Results
Model Outputs
Training Dynamics
Use Cases & Intended Uses
Out-of-Scope Uses
Ethical Considerations & Societal Impact
Inference Guide
Free Inference — Try It Now
Quantization
Community Contributions
Format Conversion
Speculative Decoding
Related Work
Environmental Impact
Citation

Model Architecture

Stentor3-20M is a LlamaForCausalLM model with Grouped Query Attention (GQA), a 4,096-token TokenMonster vocabulary, and a 4096-token context window.

Component	Value	Notes
Architecture	`LlamaForCausalLM`	Standard transformer decoder
Hidden Size	320
Intermediate Size (FFN)	1,280	SwiGLU activation
Num Hidden Layers	12
Num Attention Heads	10
Num Key/Value Heads	2	GQA — 5:1 query-to-KV ratio
Context Length	4,096 tokens
Head Dimension	32
Vocab Size	4,096	TokenMonster english-4096-strict-nocapcode-v1
Total Parameters	20,324,160
Positional Encoding	RoPE	`rope_theta = 50,000.0`
Activation	SiLU	SwiGLU implementation
Tied Embeddings	False	Separate input/output embeddings
Implementation	SDPA	PyTorch Scaled Dot Product Attention

GQA explanation & parameter count breakdown

Why GQA?

Grouped Query Attention (8Q, 2KV) reduces the KV cache memory footprint by 75% at inference time compared to standard Multi-Head Attention at the same hidden size. At a 4096-token context window this matters substantially: the KV cache for a single sequence is proportional to 2 × num_kv_heads × head_dim × num_layers × seq_len. With 2 KV heads instead of 8, the cache shrinks to one-quarter of its full-MHA equivalent, enabling longer generation on memory-constrained hardware.

GQA at this scale worked cleanly — the training was stable from start to finish. However, I should note that this was not the case for every Stentor model. During development of Stentor2-12M-Preview, GQA (at that smaller 12M scale) was a major contributor to training instability. For the Stentor2 series I switched to full MHA because the KV-head reductions kept collapsing. This experience shaped how I approached GQA for Stentor3: at 20M and especially at 50M (over 2× larger), the stability margin is substantially wider. GQA was a major factor in the 12M instability, but there were other contributing factors (training recipe, hardware precision handling) that compounded the problem. By Stentor3, the new TPU-based training regime provided stable training end-to-end, and GQA worked without issue.

Why TokenMonster-4k?

Two deliberate reasons — both unconventional for a model this size.

Reason 1 — Non-tied embeddings become affordable.

Most small language models use tied (shared) input and output embedding weights. The original motivation comes from Press & Wolf (2016): sharing the two matrices reduces parameter count significantly, acts as a form of regularization, and was shown to improve perplexity in early recurrent models. With a 32K vocabulary, the two matrices together would consume ~33M parameters on top of a ~20M transformer backbone — essentially doubling the model. Tying halves that cost, so at large vocabularies tying is practically necessary for small models.

With a 4K vocabulary the math changes completely. Each embedding matrix is only ~1.3M parameters. Keeping them separate adds just 1.3M parameters — 6.4% of Stentor3-20M's total. That's an affordable price to let the two matrices specialize.

Why does that specialization matter? The input embedding is asked to encode a token into a vector that represents its meaning in context. The output projection is asked to decode a hidden state into which token should come next. These are related but genuinely distinct tasks. A 2026 paper (Weight Tying Biases Token Embeddings Towards the Output Space, Lopardo et al., arXiv 2603.26663) provides mechanistic evidence that tied weights skew the shared matrix toward output prediction at the cost of input representation quality — and that this trade-off hurts more as model scale increases. The Qwen3 family explicitly ties embeddings for its smaller models and unties them for larger ones, acknowledging that the calculus shifts with scale.

My opinion, not tested: At 20M parameters with a 4K vocab, the untied design likely gives the model better input representations without meaningfully harming output quality. The 12.9% embedding share is high compared to 70B-parameter models (where vocab params are a rounding error), but it is tiny relative to typical models at this size — most comparable models spend 16–33% of their parameter budget on embeddings at 32K vocab. The extra 1.3M parameters are well spent. This is a design preference I believe in; I have not run an ablation to prove it.

Reason 2 — Each token type gets 8× more training signal.

With 8.5B training tokens spread over 4,096 token types, each type appears on average ~2.1 million times during training. With a 32K vocabulary, the same 8.5B tokens would give each type only ~266,000 training examples on average — an 8× reduction. To achieve the same per-type coverage with 32K vocab would require approximately 67.2B tokens, which is far beyond what Kaggle's weekly free TPU quota allows.

The acknowledged downside: a 4K vocabulary tokenizes less efficiently than a 32K one — each document produces longer token sequences, which increases context usage and compute per sample. I consider this an acceptable trade-off at this scale. The training signal density gain outweighs the efficiency cost for a model training on free compute with a hard token budget.

Parameter Count Breakdown

def estimate_llama_params_gqa(vocab_size, hidden_size, intermediate_size,
                              num_hidden_layers, num_attention_heads, num_key_value_heads):
    kv_dim = int(hidden_size * num_key_value_heads / num_attention_heads)
    q_proj = hidden_size * hidden_size
    k_proj = hidden_size * kv_dim
    v_proj = hidden_size * kv_dim
    o_proj = hidden_size * hidden_size
    attn = q_proj + k_proj + v_proj + o_proj
    mlp = 3 * hidden_size * intermediate_size # gate, up, down
    norm = 2 * hidden_size # input + post-attention RMSNorm
    total = 2 * vocab_size * hidden_size + num_hidden_layers * (attn + mlp + norm) + hidden_size
    return total

Plugging in Stentor3-20M values:

kv_dim = 320 × (2/10) = 64
q_proj = 320 × 320 = 102,400
k_proj = 320 × 64  = 20,480
v_proj = 320 × 64  = 20,480
o_proj = 320 × 320 = 102,400
attn/layer = 245,760
mlp/layer = 3 × 320 × 1,280 = 1,228,800
norm/layer = 2 × 320 = 640
per_layer = 1,475,200
embedding = 4,096 × 320 = 1,310,720
lm_head = 4,096 × 320 = 1,310,720  (untied)
layers = 12 × 1,475,200 = 17,702,400
final_norm = 320
total = 1,310,720 + 1,310,720 + 17,702,400 + 320 = 20,324,160 ✓

Component	Parameters	% of Total
Embedding Table	1,310,720	6.4%
LM Head (untied)	1,310,720	6.4%
Transformer Layers × 12	17,702,400	87.1%
— Attention (per layer × 12)	2,949,120	14.5%
— FFN/MLP (per layer × 12)	14,745,600	72.5%
— Layer Norms (per layer × 12)	7,680	0.04%
Final RMS Norm	320	0.002%
Total	20,324,160	100%

With a compact 4K vocabulary, embedding takes only 12.9% of the parameter budget — leaving 87.1% for the transformer stack. This is an unusually efficient allocation, made possible by TokenMonster's English-optimized compression.

Architectural Foundations: The Depth Delusion & Width vs. Depth Tradeoffs

Stentor3-20M implements a "Width over Depth" philosophy informed by recent research into small model scaling.

The Depth Delusion (Wu et al., 2026): Argues that for AI models, increasing depth has diminishing returns. Stentor3-20M uses a balanced 12-layer stack with a 320-hidden dimension size to maximize overall efficiency.
The Depth Myth (Izumoto, 2026): Synthesizes findings from 30+ papers to argue that modern LLMs are systematically too deep and narrow.

Head-to-Head: StentorLabs Model Family

Comparison table across all generations

Property	Stentor-12M (Gen 1)	Stentor-30M (Gen 1)	Stentor2-12M	Stentor2-30M	Stentor3-20M	Stentor3-50M	Portimbria-150M
Vocabulary	32,768 (Mistral BPE-32k)	32,768 (Mistral BPE-32k)	8,064 (TokenMonster-8k)	8,064 (TokenMonster-8k)	4,096 (TokenMonster-4k)	4,096 (TokenMonster-4k)	32,768 (Mistral BPE-32k)
Hidden Size	192	256	256	512	320	512	768
Intermediate Size	576	1,024	512	1,024	1,280	1,536	2,048
Num Layers	9	21	12	10	12	15	20
Attention Heads	3	4	4	8	5	8	6
KV Heads	3 (MHA)	4 (MHA)	4 (MHA)	8 (MHA)	1 (GQA)	2 (GQA)	2 (GQA)
Head Dimension	64	64	64	64	64	64	128
Context Length	512	512	1,024	1,024	4,096	4,096	4,096
Total Parameters	12.0M	30.4M	12.3M	30.4M	20.3M	49.4M	151.0M
Embedding Share	52.2%	27.6%	16.8%	13.6%	12.9%	8.5%	16.7%
Tied Embeddings	Yes	Yes	Yes	Yes	No	No	Yes
Training Tokens	~200M	~600M	~480M	~800M	~8.5B	~8.3B	~6B
Training Hardware	2× T4	1× T4	2× T4	2× T4	TPU v5e-8	TPU v5e-8	TPU v5e-8
Training Time	~1.3h	~7.88h	~5h	~6.75h	~8.16h	~8.74h	~8h
Best Perplexity	89.01	33.02	26.61	18.07	14.02	9.64	18.00
Tokenizer	Mistral BPE-32k	Mistral BPE-32k	TokenMonster-8k	TokenMonster-8k	TokenMonster-4k	TokenMonster-4k	Mistral BPE-32k

Cross-generation comparison guide:

Gen 1 (Stentor-12M, Stentor-30M) and Portimbria-150M all use Mistral BPE-32k and are the closest cross-gen comparisons. Other architectural differences (context length, GQA, training data) still affect results, but the shared tokenizer makes comparisons more meaningful than across tokenizer families.

Gen 2 (Stentor2-12M, Stentor2-30M) use TokenMonster-8k and are only directly comparable within Gen 2.

Gen 3 (Stentor3-20M, Stentor3-50M) use TokenMonster-4k and are only directly comparable within Gen 3.

For a controlled strength comparison between Stentor3 models, Portimbria, and other similar-scale models evaluated under identical conditions, see the Benchmark Results section.

PPL comparison caveat: Perplexity is not directly comparable across tokenizer families. A smaller vocabulary mechanically lowers raw perplexity. Within Gen 3, the comparison between Stentor3-20M (14.02) and Stentor3-50M (9.64) is valid — same tokenizer, same training setup.

Memory Requirements

How much VRAM you need depends on precision and whether you're generating (which activates the KV cache). The table below covers a single sequence at full 4096-token context — KV cache scales linearly, so at 1024 tokens it's roughly ¼ of the values shown.

Precision	Weights	KV Cache (4096 ctx)	Total VRAM
FP32	~81 MB	~12 MB	~93 MB
FP16 / BF16	~41 MB	~6 MB	~47 MB
INT8	~21 MB	~6 MB	~27 MB
INT4	~11 MB	~6 MB	~17 MB

KV cache note: GQA (2 KV heads) reduces the KV cache by 80% vs standard MHA at the same hidden size. Formula: 2 (K+V) × 2 (KV heads) × 32 (head_dim) × 12 (layers) × seq_len × bytes_per_element.

Weights note: Weights are saved as FP32 in safetensors. Cast on load with torch_dtype=torch.float16 or torch_dtype=torch.bfloat16 to halve weight memory. INT8/INT4 figures require bitsandbytes quantization as shown in the Quantization section.

🚀 Quick Start

1. Install Dependencies

pip install transformers torch safetensors tokenmonster

2. Load the Model

from transformers import AutoModelForCausalLM
import torch

# TokenMonsterTokenizer is saved alongside the model and loaded via trust_remote_code
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Stentor3-20M",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(
    "StentorLabs/Stentor3-20M",
    trust_remote_code=True,
)
model = model.eval()

ℹ️ TokenMonster tokenizer. Stentor3-20M uses a compact 4,096-token TokenMonster vocabulary. trust_remote_code=True loads the bundled tokenmonster_hf.py wrapper automatically. You also need pip install tokenmonster for the underlying binary.

3. Generate Text

prompt = "The history of computing began"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(next(model.parameters()).device)
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():
    output = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )
generated = output[0][input_ids.shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))

Pipeline usage & recommended generation settings

4. Using the Pipeline

from transformers import pipeline
pipe = pipeline(
"text-generation",
model="StentorLabs/Stentor3-20M",
torch_dtype=torch.float16,
device_map="auto",
)
result = pipe(
"Neural networks are computational models",
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
print(result[0]["generated_text"])

5. Recommended Generation Settings

These settings are based on manual testing and personal preference — not verified numerical benchmarks.

Parameter	Recommended Range	Notes
`temperature`	0.5 – 0.8	Lower end for focused/factual output; upper end for more varied generation
`top_p`	0.8 – 0.9	Nucleus sampling range that works well in practice
`repetition_penalty`	1.15 – 1.4	Strongly recommended; prevents repetition loops common in small base LMs
`max_new_tokens`	40 – 4,096	Depends on task

⚠️ Important Limitations

Not Instruction-Tuned: This is a base model. It will continue text, not follow instructions.
No Safety Tuning: No RLHF, no DPO, no content filtering.
Limited Factual Reliability: 20M parameters cannot store reliable world knowledge.
Context Window: Hard limit of 4,096 tokens.
English Only: TokenMonster english-4096 is optimized for English; other languages will tokenize poorly.
TokenMonster Required: Loading the tokenizer requires pip install tokenmonster and trust_remote_code=True.
Repetition Without Penalty: Always use repetition_penalty ≥ 1.15.

📋 Honest Notices

Candid first-hand observations about this model

These are candid observations based on direct generation testing.

It generates coherent sentences. The model reliably produces grammatical English. Even on confusing or unusual prompts, outputs are mostly real words in valid sentences — just often off-topic ones.
It struggles to stay on topic. The most consistent failure mode. A prompt about exercise leads to a rambling passage about muscle balance and pain management. A prompt about books triggers a response about programming language acronyms. The model grasps the subject area but veers into loosely related territory within a few sentences. Longer, more specific prompts help significantly.
It generates Unicode replacement characters (❑). A small number of responses contain ❑ (U+FFFD) characters. The model was not trained on StenCore-PDF and has no obvious clean source for this token — it likely appeared in code comments or data artifacts in the web/code corpus. Uncommon but reproducible.
It will generate code even when the prompt has nothing to do with code. Several average responses on non-code prompts contain code-like syntax, variable names, or programming terminology. The code training data bleeds into general text generation, especially on short or ambiguous prompts like "Hi!!!" or "1 + 1 =".
Decent code structure for its size and training data, but not reliable for autocomplete. Given a Python code prompt, the model produces recognizable structure — for loops, print statements, function calls. The logic is wrong and the code would not run. It has clear potential if trained on significantly more code, but is not suitable for production use as-is.
Horrible at math. "1 + 1 =" produces a repetition loop about the value of a variable. The model knows that math prompts involve numbers and formulas but has no arithmetic ability.
Repeats a lot, especially under short or ambiguous prompts. "Hi!!!" collapses into a loop about variable values. "1 + 1 =" repeats the same phrase more than a dozen times before hitting the 200-token limit. A repetition penalty of at least 1.15 is essential — without it, short prompts frequently collapse into tight loops.

Training Infrastructure

Hardware, software stack & throughput details

Hardware

Component	Specification
Accelerator	Google Cloud TPU v5e
Chip Configuration	8-chip pod slice (v5e-8)
Active Training Processes	8 (one per chip via torchrun + PJRT)
Global Batch Tokens/Step	262,144 (8 × 4,096 × 8 processes)
Platform	Kaggle Notebooks (free tier)
Orchestration	HuggingFace Accelerate + torchrun
Process Group Init	`env://` (XLA backend)

Software Stack

Package	Role
PyTorch 2.6	Core tensor operations
torch_xla 2.6	XLA/TPU backend
HuggingFace Transformers	Model architecture (LlamaForCausalLM)
HuggingFace Accelerate	Distributed training orchestration
HuggingFace Datasets	Data loading and streaming
safetensors	Model serialization

Throughput

Metric	Value
Average global tokens/sec	~304,000
Per-chip tokens/sec	~38,000
Total training tokens	~8,500,000,000
Total wall-clock time	29,323s (~8.15h)
Optimizer steps completed	32,425 (full run — no session deadline hit)

Training Hyperparameters — Complete Reference

Full hyperparameter tables (optimizer, batch, schedule, checkpointing)

Core Training Parameters

Hyperparameter	Value	Notes
`learning_rate`	8e-4	Peak AdamW LR
`weight_decay`	0.01	Applied to Linear weights only
`max_grad_norm`	1.0	Gradient clipping
`optimizer`	AdamW	`betas=(0.9, 0.95)`, `eps=1e-8`
`scheduler`	Cosine	With linear warmup
`warmup_steps`	1,621	5% of max_train_steps
`stable_steps`	25,940	80% of max_train_steps (WSD reference; cosine scheduler used)
`max_train_steps`	32,425	Full run completed
`token_budget`	8,500,000,000	Total training token cap
`source_token_budget`	8,500,000,000	Source data token cap
`seed`	42
`mixed_precision`	bf16	Native TPU BF16

Batch & Sequence Parameters

Hyperparameter	Value	Notes
`per_device_train_batch_size`	8	Per TPU chip
`num_processes`	8	One per chip
`total_batch_size`	64	8 × 8
`block_size`	4,096	Sequence / context length
`tokens_per_optimizer_step`	262,144	`total_batch_size × block_size`
`gradient_accumulation_steps`	1	No accumulation
`num_train_epochs`	1	Token budget exhausted within epoch 0
`pack`	True	Required for TPU static shapes

Evaluation & Checkpointing

Hyperparameter	Value
`eval_steps`	1,000
`best_eval_steps`	1,000
`best_eval_start_step`	1,000
`max_eval_samples`	5,000

AdamW Optimizer — Detailed

Decay group: All nn.Linear weight matrices → weight_decay = 0.01
No-decay group: Bias terms, normalization parameters, embedding parameters → weight_decay = 0.0
Betas: (0.9, 0.95)
Epsilon: 1e-8
Fused kernel: Enabled when CUDA available (not applicable on TPU)

Learning Rate Schedule

Phase 1 — Warmup (steps 0–1,621):
LR ramps linearly from 0 → 8e-4
Phase 2 — Cosine Decay (steps 1,621–32,425):
LR decays from 8e-4 → 0 following a cosine curve

Precision Stability Recipe

FP32 norm patching, critical layer wrapping & recipe summary

Training on TPU v5e in BF16 requires deliberate precision management to avoid gradient instabilities at 20M scale.

1. FP32 Normalization Layers (25 modules)

All RMSNorm modules are monkey-patched to compute in FP32:

def _fp32_norm_forward(hidden_states, *args, _orig=original_forward, **kwargs):
input_dtype = hidden_states.dtype
output = _orig(hidden_states.float().contiguous(), *args, **kwargs)
if torch.is_floating_point(output):
output = output.to(input_dtype)
return output

Count: 12 layers × 2 norms each + 1 final norm = 25 modules total.

2. FP32 Critical Layers (2 layers)

The first and last transformer layers run their entire forward pass in FP32:

Weights remain in their training dtype; inputs are cast to .float() on entry
torch.amp.autocast("cuda", enabled=False) prevents re-downcasting

Rationale: Boundary layers — where embeddings project in and logits project out — are most sensitive to numerical precision. Wrapping them in FP32 provides a stable floor at minimal compute cost.

3. FP32 Attention Softmax — Skipped

Not applied. PyTorch SDPA handles softmax numerical stability internally and requires FP16/BF16 inputs for its optimized code paths on both CUDA and XLA.

Recipe Summary

Technique	Count	Scope
FP32 norm modules	25	All RMSNorm layers
FP32 critical layers	2	First + last transformer layers
FP32 softmax modules	0	Skipped — SDPA incompatible

Data Pipeline

Training data sources, curriculum design & preprocessing details

Training used a web/code/math curriculum with the following source mix:

Source	Dataset	Ratio
Web	`epfml/FineWeb-HQ` (CC-MAIN-2024-51)	75%
Code	`Cyrile/dataset-the-stack-v2-dedup-sub` (Python)	15%
Math	`HuggingFaceTB/finemath` (finemath-4plus)	10%

Total tokens processed: ~8,500,000,000 (single epoch over source data, full 32,425-step run completed)

Curriculum Design

Training used a curriculum anneal over the final 15% of the token budget, upweighting code and math relative to web text. This front-loads web generalization while ensuring the model sees a higher concentration of structured/formal content near the end of training.

Text Preprocessing

def clean_text(text: str, preserve_linebreaks: bool = False) -> str:
text = unicodedata.normalize("NFKC", text)
text = text.replace("\\r\\n", "\\n").replace("\\r", "\\n")
if preserve_linebreaks:
lines = [line.rstrip() for line in text.splitlines()]
text = "\\n".join(lines).strip()
else:
lines = [line.strip() for line in text.splitlines() if line.strip()]
text = " ".join(lines)
text = " ".join(text.split())
return text

NFKC normalization maps visually-equivalent Unicode to canonical form
Linebreak preservation for code samples (not applicable to web/math)
Whitespace collapse for web/math text

Sequence Packing

Samples are packed into fixed 4,096-token blocks. Labels are identical to input_ids (causal LM objective). No cross-document attention masking is applied between packed samples — this is standard practice for web-text pretraining.

Weight Initialization

Initialization scheme & residual scaling code

def initialize_weights(model, std=0.02, num_hidden_layers=12):
layer_count = 12
residual_std = std / math.sqrt(2.0 * layer_count) # ≈ 0.00408
for name, module in model.named_modules():
if isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
elif isinstance(module, nn.Linear):
# Scaled-down std for output projections (residual path)
proj_std = residual_std if name.endswith(("o_proj", "down_proj")) else std
module.weight.data.normal_(mean=0.0, std=proj_std)
if module.bias is not None:
module.bias.data.zero_()
elif "rmsnorm" in type(module).__name__.lower():
if module.weight is not None:
module.weight.data.fill_(1.0)

Residual projections (o_proj, down_proj) use scaled-down std (0.02 / sqrt(2 × 12) ≈ 0.00408) to prevent residual stream explosion at initialization, following the GPT-2 convention.
All other Linear layers use std=0.02.
RMSNorm scales start at 1.0 (identity).

Evaluation & Results

Training loss & perplexity curves, family comparison, full checkpoint history

Training Loss Curve

Validation Perplexity Curve

Final result: best validation loss 2.6406 — perplexity 14.02.

Full Checkpoint History

Step	Eval Loss	Perplexity	Notes
1,000	4.8125	~122.9	First best checkpoint
2,000	4.2812	~72.4
3,000	4.0625	~58.2
4,000	3.6875	~39.9
5,000	3.3906	~29.7
6,000	3.2812	~26.6
7,000	3.1719	~23.9
8,000	3.0938	~22.1
10,000	3.0156	~20.4
11,000	2.9375	~18.9
12,000	2.9219	~18.6
13,000	2.8906	~18.0
14,000	2.8750	~17.7
15,000	2.8594	~17.4
17,000	2.7812	~16.1
18,000	2.7656	~15.9
20,000	2.7031	~14.9
23,000	2.6875	~14.7
24,000	2.6719	~14.5
29,000	2.6406	14.02	Best checkpoint saved
Final (step 32,425)	2.6406	14.02	Full run completed

Benchmark Results

All benchmarks are 0-shot unless otherwise noted. GSM8K was run at both 0-shot and 5-shot.

Stentor3-20M Benchmarks

Common Sense Reasoning

Benchmark	Metric	Score
Winogrande	acc	47.51%
HellaSwag	acc_norm	26.17%
CommonsenseQA	acc	19.57%

World Knowledge & Science

Benchmark	Metric	Score
ARC-Easy	acc_norm	33.75%
ARC-Challenge	acc_norm	22.53%
ARC Average		28.14%
OpenBookQA	acc_norm	28.00%
TruthfulQA (MC2)	acc	49.00%

Multitask Knowledge (MMLU)

Benchmark	Metric	Score
MMLU Average	acc	22.95%
MMLU — Humanities	acc	[placeholder]
MMLU — Other	acc	[placeholder]
MMLU — Social Sciences	acc	[placeholder]
MMLU — STEM	acc	[placeholder]

Mathematics

Benchmark	Metric	Score
GSM8K (5-Shot)	exact_match	0.15%
GSM8K (0-Shot)	exact_match	0.76%

Comparison against peer models, analysis & evaluation script

Comparison Against Peer Models

All models were evaluated by me under identical conditions — same script, same hardware (Kaggle 2× T4), same lm-eval-harness version, same batch size, same seed. All Stentor3-20M scores are 0-shot except GSM8K 5-shot. All peer model scores were produced under the same conditions. Portimbria-150M was re-evaluated with this script rather than reusing numbers from its model card; if any scores match the card, that means the model produced identical results under both evaluation setups.

Bold = best in column. Underline = 2nd best. Avg↑Random = mean of (score − baseline) across all 10 benchmarks; TruthfulQA baseline is the lowest AI score in the table (DistilGPT2, 0.4447) since it has no random-chance floor.

Model	OpenBookQA (acc_norm)	TruthfulQA (acc)	Winogrande (acc)	CommonsenseQA (acc)	HellaSwag (acc_norm)	ARC-Easy (acc_norm)	ARC-Challenge (acc_norm)	MMLU (acc)	GSM8K 0-sh (flexible)	GSM8K 5-sh (flexible)	Avg↑Random
Stentor3-50M	0.2640	0.5000	0.5028	0.1957	0.2638	0.2950	0.2406	0.2295	0.0061	0.0197	0.0123
Stentor3-20M	0.2800	0.4900	0.4751	0.1957	0.2617	0.3375	0.2253	0.2295	0.0076	0.0015	0.0109
DistilGPT2	0.2680	0.4447	0.5043	0.1957	0.2751	0.3674	0.2287	0.2290	0.0129	0.0159	0.0147
Pythia-31M	0.2660	0.4695	0.4925	0.1957	0.2726	0.3426	0.2133	0.2292	0.0106	0.0121	0.0110
Pythia-70M	0.2520	0.4701	0.5185	0.1966	0.2741	0.3472	0.2193	0.2296	0.0061	0.0076	0.0127
TinyStories-33M	0.2480	0.4764	0.5241	0.1884	0.2716	0.2681	0.2329	0.2448	0.0083	0.0068	0.0075
Portimbria-150M	0.2460	0.4694	0.5288	0.1990	0.2745	0.3375	0.2253	0.2317	0.0061	0.0197	0.0144
Random / Baseline	0.250	0.4447	0.500	0.200	0.250	0.250	0.250	0.250	—	—	—

Analysis

OpenBookQA: Stentor3-20M leads at 0.2800 — highest in the table, notably beating all larger models including Stentor3-50M (0.2640) and Portimbria-150M (0.2460).

ARC-Easy: Stentor3-20M scores 0.3375, tied with Portimbria-150M for 3rd. Competitive for a 20M model.

TruthfulQA MC2: 0.4900, second only to Stentor3-50M (0.5000) — a strong result for this scale.

GSM8K: Near-zero across all models at this scale. Math reasoning at 20M parameters is essentially absent.

Where Stentor3-20M trails: Winogrande, HellaSwag, and GSM8K 5-shot. These are data-hungry tasks where the larger token budgets of Pythia (300B) and more general web-text coverage pull ahead. At 20M parameters and ~8.5B tokens (of which only ~6.4B are English web text), the model has less capacity and fewer training examples than the Pythia series for these pattern-completion tasks.

Overall: Stentor3-20M competes effectively with models 1.5–3× its size on several benchmarks. The Avg↑Random of 0.0109 places it near Pythia-31M (0.0110) despite having fewer parameters, fewer total training tokens, and a much smaller vocabulary.

Evaluation Setup (for Stentor3-20M)

Benchmarks were run on Kaggle with 2× Tesla T4 GPUs using the script below. Blocks are checkpointed to run_summary.json so a single failure never stops the rest, and completed blocks are skipped on re-run.

import os, sys, subprocess, json, time, re, threading, random, glob
from pathlib import Path
from datetime import datetime

# ── Environment & Networking ──────────────────────────────────────────────────
os.environ["TOKENIZERS_PARALLELISM"]      = "false"
os.environ["CUDA_LAUNCH_BLOCKING"]        = "0"
os.environ["NCCL_P2P_DISABLE"]            = "1"
os.environ["NCCL_IB_DISABLE"]             = "1"
os.environ["NCCL_SHM_DISABLE"]            = "1"
os.environ["NCCL_SOCKET_IFNAME"]          = "eth0"
os.environ["HF_HUB_ENABLE_HF_TRANSFER"]   = "1"

# ── Install Deps ──────────────────────────────────────────────────────────────
print(f"\n{'='*75}", flush=True)
print("  [Setup] Installing dependencies (this may take a minute)...", flush=True)

subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U",
                "accelerate", "transformers", "hf_transfer",
                "tokenmonster", "diskcache", "litellm[caching]",
                "einops", "sentencepiece", "tiktoken"],
               check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U",
                "git+https://github.com/EleutherAI/lm-evaluation-harness.git"],
               check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

print("  ✅  Dependencies installed.", flush=True)

# ── Patch lm_eval (TokenMonster fix) ─────────────────────────────────────────
get_path_cmd = [sys.executable, "-c", "import lm_eval.models.huggingface as hf; print(hf.__file__)"]
path_result = subprocess.run(get_path_cmd, capture_output=True, text=True)

if path_result.returncode == 0:
    hf_model_file = path_result.stdout.strip()
    try:
        with open(hf_model_file, "r") as f:
            code = f.read()

        safe_code = code.replace(
            "assert len(continuation_enc) > 0",
            "continuation_enc = continuation_enc if len(continuation_enc) > 0 else [0]; assert len(continuation_enc) > 0"
        )

        with open(hf_model_file, "w") as f:
            f.write(safe_code)

        print("  ✅  Patched lm_eval safely.", flush=True)

    except Exception as e:
        print(f"  ⚠️  Could not patch lm_eval: {e}", flush=True)
else:
    print("  ⚠️  Could not locate lm_eval file.", flush=True)

# ── HF Token Auth ─────────────────────────────────────────────────────────────
hf_token = os.environ.get("HF_TOKEN")

if "KAGGLE_KERNEL_RUN_TYPE" in os.environ:
    try:
        from kaggle_secrets import UserSecretsClient
        hf_token = UserSecretsClient().get_secret("HF_TOKEN")
    except Exception:
        pass

if hf_token:
    os.environ["HF_TOKEN"] = hf_token
    subprocess.run(["huggingface-cli", "login", "--token", hf_token],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    print("  ✅  HF authenticated.", flush=True)
else:
    print("  ⚠️  No HF_TOKEN found.", flush=True)

# ── Config ────────────────────────────────────────────────────────────────────
MODEL = "StentorLabs/Stentor3-20M"
DTYPE = "float16"
BATCH = "32"
SEED  = 42
OUT   = "./results"

MODEL_ARGS = f"pretrained={MODEL},dtype={DTYPE},trust_remote_code=True"
SUMMARY_PATH = f"{OUT}/run_summary.json"

# ── Pre-download Model ────────────────────────────────────────────────────────
print(f"  [Setup] Pre-downloading {MODEL}...", flush=True)

pre_dl_cmd = [
    sys.executable, "-c",
    f"from transformers import AutoConfig, AutoTokenizer; "
    f"AutoConfig.from_pretrained('{MODEL}', trust_remote_code=True); "
    f"AutoTokenizer.from_pretrained('{MODEL}', trust_remote_code=True)"
]

subprocess.run(pre_dl_cmd, capture_output=True, text=True)

print("  ✅  Model ready.", flush=True)
print(f"{'='*75}\n", flush=True)

# ── Evaluation Blocks ─────────────────────────────────────────────────────────
BLOCKS = [
    ("block1", "OpenBookQA · TruthfulQA", "openbookqa,truthfulqa_mc2", 0, None),
    ("block2", "Winogrande · CommonsenseQA",     "winogrande,commonsense_qa",      0, None),
    ("block3", "HellaSwag",                      "hellaswag",                      0, None),
    ("block4", "ARC-Easy · ARC-Challenge",       "arc_easy,arc_challenge",         0, None),
    ("block5", "MMLU",                           "mmlu",                           0, None),
    ("block6", "GSM8K (0-shot)",                 "gsm8k",                          0, None),
    ("block7", "GSM8K (5-shot)",                 "gsm8k",                          5, None),
]

LAUNCH_BASE = [
    "accelerate", "launch",
    "--multi_gpu",
    "--num_processes=2",
    "--mixed_precision=fp16",
]

# ── Helpers ───────────────────────────────────────────────────────────────────
ANSI_ESCAPE   = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])')
DEBUGGER_NOISE = re.compile(r"(Debugger warning|frozen modules|PYDEVD|make the debugger|pass -X|Note: Debugging)")

log_lock  = threading.Lock()
log_state = {"last_print": 0.0}

def ts():
    return datetime.now().strftime("%H:%M:%S")

def stream(proc):
    def _read(pipe):
        for raw in iter(pipe.readline, ""):
            line = ANSI_ESCAPE.sub('', raw.strip())
            if not line or DEBUGGER_NOISE.search(line):
                continue
            is_progress = any(x in line for x in ["%|", "it/s", "s/it", "MB/s", "GB/s", "Block"])
            with log_lock:
                now = time.time()
                if is_progress:
                    if now - log_state["last_print"] > 2.0:
                        print(f"  [{ts()}] {line}", flush=True)
                        log_state["last_print"] = now
                else:
                    print(f"  [{ts()}] {line}", flush=True)

    t1 = threading.Thread(target=_read, args=(proc.stdout,), daemon=True)
    t2 = threading.Thread(target=_read, args=(proc.stderr,), daemon=True)
    t1.start(); t2.start()
    proc.wait()
    t1.join(); t2.join()

def extract_metrics(output_dir):
    extracted = {}
    for file_path in glob.glob(f"{output_dir}/**/*.json", recursive=True):
        try:
            with open(file_path) as f:
                data = json.load(f)
            results = data.get("results", data.get("metrics", {}))
            for task, metrics in results.items():
                if isinstance(metrics, dict):
                    score = (
                        metrics.get("acc_norm") or
                        metrics.get("acc") or
                        metrics.get("exact_match") or
                        metrics.get("word_perplexity")
                    )
                    if score is not None:
                        extracted[task] = round(score, 4)
        except Exception:
            pass
    return extracted

# ── Run ───────────────────────────────────────────────────────────────────────
Path(OUT).mkdir(parents=True, exist_ok=True)

summary = {}
if os.path.exists(SUMMARY_PATH):
    try:
        summary = json.load(open(SUMMARY_PATH))
    except Exception:
        summary = {}

for i, (name, title, tasks, fewshot, extra) in enumerate(BLOCKS, 1):

    if summary.get(name, {}).get("status") == "ok":
        print(f"⏭️  Skipping {title} (already done)")
        continue

    print(f"\n{'='*75}")
    print(f"  BLOCK {i}/{len(BLOCKS)} — {title}")
    print(f"{'='*75}\n")

    port = str(29500 + random.randint(1, 999))
    cmd = LAUNCH_BASE + [
        "--main_process_port", port,
        "-m", "lm_eval",
        "--model", "hf",
        "--model_args", MODEL_ARGS,
        "--batch_size", BATCH,
        "--seed", str(SEED),
        "--tasks", tasks,
        "--num_fewshot", str(fewshot),
        "--output_path", f"{OUT}/{name}_lmeval",
    ]

    t0 = time.time()
    try:
        proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
        stream(proc)
        elapsed = round((time.time() - t0) / 60, 1)

        if proc.returncode == 0:
            metrics = extract_metrics(f"{OUT}/{name}_lmeval")
            summary[name] = {"status": "ok", "elapsed_min": elapsed, "metrics": metrics}
            print(f"✅ Done in {elapsed} min")
        else:
            summary[name] = {"status": "failed", "exit_code": proc.returncode, "elapsed_min": elapsed}
            print(f"❌ Failed ({proc.returncode})")

    except Exception as e:
        summary[name] = {"status": "failed", "error": str(e)}

    with open(SUMMARY_PATH, "w") as f:
        json.dump(summary, f, indent=2)

    time.sleep(2)

# ── Final Summary ─────────────────────────────────────────────────────────────
passed = sum(1 for v in summary.values() if v.get("status") == "ok")
print(f"\n{'='*75}")
print(f" DONE — {passed}/{len(BLOCKS)} runs succeeded")
print(f"{'='*75}")

for k, v in summary.items():
    icon = "✅" if v.get("status") == "ok" else "❌"
    print(f"  {icon} {k:<15} {v.get('elapsed_min', '—')} min")

print(f"\nSummary saved → {SUMMARY_PATH}\n")

if any(v.get("status") == "failed" for v in summary.values()):
    sys.exit(1)

Model Outputs

10 real generation samples — 4 average, 3 cherry-picked good, 3 cherry-picked bad

All generations: top_p=0.85, temperature=0.6, max_new_tokens=200, repetition_penalty=1.25. Outputs end at the 200-token limit — hitting the budget mid-output is expected behavior and is a limit of the token budget, not a flaw of the AI. The first 4 are average responses (first 4 collected with no cherry-picking), the next 3 are cherry-picked good, and the last 3 are cherry-picked bad.

Average responses

1. Prompt: "Exercise can improve strength and energy levels. Many people walk, run, or play sports every week. Regular physical activity is often connected to better health. Exercise is important for"

"making, which helps in maintaining the balance of muscle and body temperature. The use of the physical activity in different parts of the body is very effective in preventing any other diseases. The use of the physical activity in different parts of the body is very effective in the treatment of pain. The technique involves the use of the physical activity, which is essential for the maintaining the balance of muscles and bones. It can be used as a tool for controlling muscles and muscles, as it helps in maintaining the balance of muscles and bones. This can be done by using a physical activity in different parts of the body. The physical activity is simple and can be done by doing a small amount of exercise. It can be used to control muscle mass, which also helps in maintaining muscle mass. The physical activity in these parts of the body is more effective in reducing muscle mass. The physical activity in these parts of the body"

✅ Good: Stays broadly in the health/exercise domain. Produces grammatical English throughout. Correctly associates exercise with muscles, balance, and pain management.

❌ Bad: Repetitive — "the physical activity in different parts of the body" appears four times. Begins mid-sentence ("making,"). Wanders from exercise to pain treatment to muscle control without meaningful progression.

2. Prompt: "The internet is used for communication, entertainment, and education. Many students read articles and watch videos online every day. Websites provide information on many different topics. People use the internet to"

"When you read your website, you are not aware of any content that comes from textual or other sources. The Internet is used to communicate with users. Some of the most common types of Internet are: - Today, the Internet has become a popular choice for people seeking information about their web sites. This type of Internet is used by people who can help them understand what they want them to do in their domains. This is why it is useful in searching for relevant information about the internet. - Today, it can be useful for people seeking information about their web sites. - Today, it is also useful for individuals to seek information about their web sites. For example, the Internet can help you find information about your web sites, and get information about your web sites. - Today, this type of Internet is used by people who know who they are and how they want to do things. - Today, it can be useful for those who have access"

✅ Good: Recognizes the internet domain. Produces a list format with - Today, bullets, showing awareness of document structure. Grammatically clean.

❌ Bad: "Today" appears six times in a near-identical structure — a repetition pattern rather than meaningful variation. "Web sites" used repeatedly in a narrow, circular way. The opening sentence ("When you read your website") redirects the prompt oddly.

3. Prompt: "Music is listened to in many countries around the world. Different styles of music are popular with different groups of people. Songs are often used for relaxation, celebration, or entertainment. Music is popular because"

"The use of the word 'song' is a powerful tool that can help you understand and appreciate the meaning of the words. The use of the word 'song' is not just a means to add a song to your musical style. It is also an effective tool that can help you to find a song for your musical style. It also allows you to choose from various forms of music that can be used to create your music. Here are a few tips to help you visualize these words: 1. Writing songs for musical instruments: When you are writing songs for a certain time, you will be able to read it in one place. You will need to have some time to write music. 2. Have a good reading plan. 3. Have a list of all the musical instruments that you use in your music. 4. Have a list of all the musical instruments that you use in your music. 5. Have a list of all the instruments that you use in your music. 6. Have a great review. 7. Have a"

✅ Good: Stays on the music topic throughout. Transitions into a numbered how-to list, which is a coherent document structure. Points 1–3 are at least related to music.

❌ Bad: Points 3 and 4 are identical ("Have a list of all the musical instruments that you use in your music"). The opening sentence repeats "The use of the word 'song'" twice in near-identical form.

4. Prompt: "Books have been used for learning and entertainment for many years. Libraries contain information about science, history, and literature. Reading can improve vocabulary and understanding. Books are useful because"

"lines can help to identify the words in the text. A variety of sources include: - Analysis of the words in the text - Analyze the word from ancient times - Recommendate the term in a sentence by using a combination of different languages such as ALTR, ASP, or TEMP. - Experiment with various languages such as TIA, IBM, etc. - Examples of a specific language that has multiple meanings in one language. - Explore the meaning of the word. - Description of phrases in other languages. The words are also used when speaking English. - Scala or Scala: A word that is often used to refer to a particular language. The word is usually used to describe the language. - Subject: A word that is often used to describe the language. - Outside: A word that is often used to describe the language. The"

✅ Good: Produces a consistent list format. Grammatically intact.

❌ Bad: Completely off-topic — the model pivoted from books to a glossary of programming/technical language acronyms (ALTR, ASP, TIA, IBM, Scala). "Recommendate" is not a word. The prompt is about books and reading; the output has nothing to do with either.

Cherry-picked good

5. Prompt: "Healthy food provides vitamins and nutrients for the body. Fruits, vegetables, and proteins are common parts of balanced meals. Drinking enough water is also important for daily health. A healthy diet usually includes"

"A food that contains essential nutrients, such as vitamin B6, minerals, omega-3 fatty acids, and nuts, is high in fiber, vitamin C, and minerals. It is important to keep in mind that the supplement is only as simple as possible. Consuming a healthy diet with low-calorie-rich foods can help you achieve your goals. However, it is recommended to maintain proper calories in your diet for healthy weight loss. It can be convenient to enhance the overall health of your body, reducing your risk of heart disease. It is also important to keep in mind that low-calorie foods are great sources of vitamins and nutrients for healthy weight loss. To ensure that your body has regular exercise, you should make sure that you have healthy fats and nutri"

✅ Good: Correctly names real nutrients (vitamin B6, omega-3, fiber, vitamin C). Stays on the diet/health topic throughout. Produces a more coherent flow than the average samples.

❌ Bad: The contradiction "low-calorie-rich foods" is semantically broken.

6. Prompt:

import random
import numpy as np
numbers = [1, 2, 3]

array.sort_values(by="r")
for i in range(len(numbers)):
    print('Number of positive integers:', number[0] + 1)
    print(numbers[i] - 1)
    print(numbers[i] - 1)
    print()
#print(numbers)
 for i in range(len(numbers)):
    #print(numbers[i])
    print(numbers[i] - 1)
    print(numbers[i] - 1)
    #print(numbers[i] - 1)
    print(numbers[i] - 1)
if __name__ == '__main__':
    print("Search for "+ str(numbers)) transformation

✅ Good: The model correctly continues the code prompt with Python syntax. It uses the numbers variable from the prompt, writes a valid for loop structure, uses range(len(...)), includes # comments, and correctly formats a if __name__ == '__main__': block. This is recognizable, structured Python for a model trained on ~1.2B code tokens.

❌ Bad: array.sort_values(by="r") references a variable that doesn't exist. number[0] should be numbers[0]. The inner loop is a copy of the outer loop with no meaningful difference. transformation at the end is a stray token. The code would not run correctly, but the structure is there.

7. Prompt: ";kd.ks'xsdlkfa'jmclsgdf'z.s" (intentionally nonsensical to test gibberish continuation)

"7. Use data to determine the relationship between the variables and data. 8. Problem solving. 9. Use data to identify the relationship between two variables. 10. Problem solved."

✅ Good: Rather than continuing the gibberish, the model pivoted to producing a coherent numbered list — grammatically clean and internally consistent. It is not instruction-tuned so there was no expectation it would identify the input as gibberish; the goal was coherent output regardless of input, and it delivered that. The response is short and self-contained.

❌ Bad: The numbered list starts at 7, implying a continuation of something that doesn't exist. The content (data analysis steps) has nothing to do with the input and reflects the code/data corpus bleeding through, not actual understanding of the prompt.

Cherry-picked bad

8. Prompt: "Hi!!!"

"UnitValues. The value of the data can be calculated by dividing the number of values by the number of values and the number of values for the variable. For example, the data can be calculated using the following formula: The value of each value le is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable"

✅ Good: Grammatically coherent for the first two sentences.

❌ Bad: "Hi!!!" triggers a data-analysis response with no connection to the prompt. Enters a repetition loop ("The value of the variable is given by:" repeated six times). "value le" is a broken word. This is the clearest example of how short, ambiguous prompts destabilize the model.

9. Prompt: "1 + 1 ="

"# of variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable he"

✅ Good: Recognizes a math context and responds in a math/formula register.

❌ Bad: Pure repetition loop — "The value of the variable is given by:" repeated twelve times. Cannot do arithmetic. The response has no relationship to the answer 2.

10. Prompt: "Import"

"bed., a randomly_value=True) bed., a , ❑❑❑❑❑❑❑❑❑❑ > 0.1.85, ❑❑❑❑❑ [], ❑❑❑❑❑ @nyagridcore,❑❑❑❑❑ #####################"

✅ Good: The model correctly interprets "Import" as a code token and responds with code-like output. randomly_value=True and @nyagridcore are code-shaped, even if invalid.

❌ Bad: Contains multiple ❑ (Unicode replacement character U+FFFD) sequences — the most visible occurrence of this artifact in testing. The output is not valid code or valid text. "0.1.85" is not a valid version string. "bed." is a stray token. This is the model's worst-case output: token artifacts, broken syntax, and garbage characters all at once.

Training Dynamics

Step-by-step training phase breakdown & throughput details

The training run processed approximately 8.5 billion tokens across a single epoch (epoch 0), running for 32,425 optimizer steps — the full planned run, completed without hitting the session deadline.

Early training (steps 0–1,621, warmup phase): LR ramped linearly from 0 to peak. Loss dropped quickly from above 5.0. First best checkpoint recorded at step 1,000 (eval loss 4.8125).

Mid training (steps 1,621–32,425, cosine decay phase): Smooth and consistent loss reduction. Gradient norms were well-behaved in the 0.34–0.60 range for most steps, with one notable spike (step 13,200: grad_norm 6.625 — recovered cleanly in the next logged step with no apparent impact on the loss trajectory) and a late spike at step 29,700 (grad_norm 13.8125 — also recovered cleanly). New best checkpoints recorded at steps 1,000 / 2,000 / 3,000 / 4,000 / 5,000 / 6,000 / 7,000 / 8,000 / 10,000 / 11,000 / 12,000 / 13,000 / 14,000 / 15,000 / 17,000 / 18,000 / 20,000 / 23,000 / 24,000 / 29,000.

Run completion: The full 32,425-step run completed within the session window. The best model (step 29,000, eval loss 2.6406) had already been saved. Final eval loss 2.6406, perplexity 14.02.

Throughput: ~~304,000 global tokens/sec average (~~38,000 per chip), with a brief XLA warmup window reset at step 600.

Total wall-clock time: ~8.15 hours (29,323s total, including eval, save, and subprocess overhead).

Use Cases & Intended Uses

Use Case	Suitability	Notes
Studying transformer training dynamics at 20M scale	✅ High	Full architecture, hyperparameters, and training curves published
Benchmarking 4K-context inference latency	✅ High	Realistic long-context workload
Quantization / conversion pipeline testing	✅ High	Standard architecture, no custom ops
Teaching material for LLM courses	✅ High	Fully documented, reproducible from scratch
Edge deployment experiments	✅ High	~41MB in FP16; extremely feasible on modern edge hardware
Domain-specific fine-tuning research	✅ High	Standard transformers; fine-tune like any LLaMA model
Speculative decoding draft model	✅ Medium	Fast enough to draft; uses rare TokenMonster-4k vocab so finding a vocab-matched target is difficult — cross-vocabulary draft is possible but lowers acceptance rate
Text continuation / creative writing	✅ Medium	Good fluency; limited thematic fidelity
Code completion prototyping	❌ Not suitable	Generates code-shaped output but it doesn't work — see Honest Notices
Factual Q&A	❌ Not suitable	Unreliable world knowledge at this scale
Production deployment	❌ Not suitable	No safety tuning
Non-English text	❌ Not suitable	Training data is English-heavy
Instruction following	❌ Not suitable	Base model only

Out-of-Scope Uses

Any user-facing application — No safety filtering, no alignment, no factual reliability.
Medical, legal, or financial advice — Cannot reason reliably over specialized knowledge.
Generating content about real people — Will fabricate.
Automated content pipelines — Output quality is insufficient for unreviewed publication.
Instruction following — This is a base next-token predictor.

Ethical Considerations & Societal Impact

Data biases, safety considerations & societal impact

Inherited Data Biases

Trained on FineWeb-HQ, The Stack v2 dedup subset, and FineMath-4+ — all derived from web-scraped data. The model inherits:

Western-centric perspective — English-language web text skews toward Western viewpoints and cultural contexts.
English monolingualism — TokenMonster english-4096 is optimized for English. Other languages will produce high fertility and poor quality.
Demographic underrepresentation — Groups underrepresented in English web text will be underrepresented in outputs.
Code ecosystem bias — The code training data covers Python only (Cyrile/dataset-the-stack-v2-dedup-sub). Other languages are not represented.

No Safety Tuning

No RLHF, DPO, constitutional AI, or content filtering of any kind has been applied.

Positive Aspects

Democratizing AI research — Trained entirely on free Kaggle TPU compute.
Full transparency — Complete training hyperparameters, architecture, and training curves published.
Minimal environmental footprint — ~8 hours of TPU compute is negligible versus large-scale pretraining runs.

Inference Guide

CPU inference (INT8) & GPU inference (FP16) code

CPU Inference (INT8 Dynamic Quantization)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor3-20M")
tokenizer = AutoTokenizer.from_pretrained(
    "StentorLabs/Stentor3-20M",
    trust_remote_code=True,
)
# Dynamically quantize for CPU
model_int8 = torch.quantization.quantize_dynamic(
model.cpu(),
{torch.nn.Linear},
dtype=torch.qint8,
)
inputs = tokenizer("The laws of physics state that", return_tensors="pt")
with torch.inference_mode():
output = model_int8.generate(**inputs, max_new_tokens=80, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

GPU Inference (FP16)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Stentor3-20M",
    torch_dtype=torch.float16,
    device_map="cuda",
).eval()
tokenizer = AutoTokenizer.from_pretrained(
    "StentorLabs/Stentor3-20M",
    trust_remote_code=True,
)
def generate(prompt, max_new_tokens=100, temperature=0.8, top_p=0.9):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        output = model.generate(
            input_ids,
            attention_mask=torch.ones_like(input_ids),
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print(generate("Once upon a time in a distant kingdom"))

🚀 Free Inference — Try It Now

No GPU, no setup, no API key required.

StentorLabs hosts a free demo space for all Stentor models:

🔗 https://huggingface.co/spaces/StentorLabs/StentorLabs-demo_space

1. Core Functional Features

⚡ Generate Tab The primary interface for standard text completion.

Model Selection: Allows switching between the flagship Portimbria-150M, various Stentor3 and Stentor2 versions, and base Stentor models.
Generation Presets: Includes quick-toggle chips (Creative, Balanced, Focused) that automatically adjust Temperature, Top-P, and Max Tokens.
Multi-Response Mode: Users can generate up to 5 sequential completions for the same prompt to observe the variance in the model's logic.
Live Metrics: Displays real-time stats including total tokens generated, elapsed time, and speed in tokens per second (t/s).
Persistent History: Maintains a visual log of recent generations with timestamps and parameters.

🔬 Token Explorer Tab A unique diagnostic feature for understanding model internal confidence.

Probability Heatmap: Generates text and color-codes each token based on the model's confidence level (e.g., Green for ≥80%, Red for <35%).
Alternative Inspection: Hovering over any generated token reveals the top 8 alternatives the model was considering at that specific moment, along with their respective probabilities.

🌡 Temperature Sweep Tab A visualization tool to study how "creativity" parameters affect output.

Parallel Generation: Runs the same prompt across 2 to 3 different temperature settings simultaneously.
Divergence Analysis: Allows users to see exactly where a model stays rigid (low temp) vs. where it becomes chaotic or creative (high temp).

🏟 Arena Tab A competitive benchmarking environment for comparing Stentor models against popular baselines.

Match Modes:
- Show: Open comparison where model names are visible.
- Blind: Identity is hidden until a vote is cast to prevent brand bias.
- Random: Pairs a random Stentor model against a baseline (like Pythia, OPT, or GPT-2).
Fair Matchmaking: Implements logic to pair models of similar parameter counts (e.g., 20M vs 31M) to ensure a balanced comparison.
Voting System: Users can vote for the better response, a tie, or "both bad," which reveals the hidden identities.

💬 Chat Tab An interface for multi-turn dialogue.

Custom Prompting: Uses a tag-based formatting style to help the base models maintain conversational context despite not being explicitly instruction-tuned in many cases.
Session State: Uses gr.State to track message history across the conversation.

Supported Models

The app maintains a library of: - Internal Models: Portimbria-150M, Stentor3 (50M/20M), Stentor2 (30M/12M), and original Stentor (30M/12M + Instruct). - Arena Baselines: DistilGPT2, Pythia (14M through 160M), gpt2 small, SmolLM2-135M, OPT-125M, and GPT-Neo.

Quantization

FP16, BF16 & 4-bit (bitsandbytes) quantization code

FP16 (GPU)

model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
torch_dtype=torch.float16,
)

BF16

model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
torch_dtype=torch.bfloat16,
)

4-bit (bitsandbytes)

pip install bitsandbytes accelerate

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
quantization_config=bnb_config,
device_map="auto",
)

🌍 Community Contributions — Build on This Model

Stentor3-20M is built by an independent solo researcher, not a large corporate AI lab. That means it doesn't have teams of engineers running downstream experiments — that's where you come in. This model is Apache 2.0 licensed and is explicitly intended to be modified, extended, and redistributed.

Here are things StentorLabs actively encourages the community to try:

Fine-tune it on your domain — instruction tuning, domain adaptation, RLHF, DPO, anything goes
Quantize it — 4-bit, 8-bit, GGUF, GPTQ, AWQ, ONNX, all highly encouraged
Convert it to other formats — GGUF for llama.cpp, ONNX for deployment, CoreML for Apple Silicon
Run LoRA or QLoRA to adapt it cheaply on consumer hardware
Use it for speculative decoding with a larger Llama-family target
Benchmark it on other benchmarks not covered here, or re-run the evaluation script in a Kaggle 2×T4 environment to reproduce and verify the results
Publish your work — fine-tunes, quantized versions, adapters, research findings, derivative models, anything

If you build something with Stentor3-20M, please share it on HuggingFace and tag or link back to the base model. Every community result makes this model more useful for everyone.

LoRA / QLoRA Starter Configuration

Starter config, recommended hyperparameters & QLoRA note

If you haven't fine-tuned a Llama-family model before, here's a proven starting point for Stentor3-20M:

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Stentor3-20M",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor3-20M")

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                    # LoRA rank — try 32 if underfitting
    lora_alpha=32,           # alpha = 2× rank is a reliable default
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: ~0.8M || all params: ~49.4M || trainable%: ~1.6%

Recommended fine-tuning hyperparameters:

Hyperparameter	Value	Notes
Learning rate	3e-5 – 1e-4	Small models are sensitive to high LRs during SFT
Optimizer	AdamW	`betas=(0.9, 0.999)`, `eps=1e-8`
LR scheduler	Cosine with warmup	~5% warmup steps
Batch size	8–32	Per device; scale with gradient accumulation
Epochs	2–5	Watch for overfitting after epoch 2
Max sequence length	512–2048	Up to 4096 is supported

For QLoRA (4-bit quantized base + LoRA adapters on top), add BitsAndBytesConfig(load_in_4bit=True) when loading the base model — the LoRA config and training hyperparameters above apply unchanged. This lets you fine-tune on a single consumer GPU with ~4–6 GB VRAM.

Format Conversion

Convert to GGUF (llama.cpp) & ONNX

Convert to GGUF (llama.cpp)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && pip install -r requirements.txt
huggingface-cli download StentorLabs/Stentor3-20M --local-dir stentor3-20m
python convert_hf_to_gguf.py stentor3-20m/ \\
--outfile stentor3-20m.gguf \\
--outtype f16
./llama-quantize stentor3-20m.gguf stentor3-20m-q4_k_m.gguf q4_k_m
./llama-cli -m stentor3-20m-q4_k_m.gguf -p "The history of computing" -n 100

Convert to ONNX

pip install optimum[exporters]
optimum-cli export onnx \\
--model StentorLabs/Stentor3-20M \\
--task text-generation-with-past \\
stentor3-20m-onnx/

Speculative Decoding

Stentor3-20M is a compact draft model that is fast and small enough to meaningfully accelerate a larger target model's inference. The caveat is its tokenizer: TokenMonster-4k is not used by any mainstream LLM, so it is not vocabulary-compatible with Llama/Mistral/GPT-family models (which use 32K+ BPE vocabularies). Native speculative decoding (where the draft and target share token IDs) requires a vocabulary-compatible target.

Speculative decoding options & vocabulary compatibility notes

Option A — Use with a vocabulary-matched target (recommended)

For maximum acceptance rates, you need a target model that either uses the same TokenMonster 4K vocab or is fine-tuned from Stentor3-20M itself (e.g. an instruction-tuned version). When both models share token IDs, HuggingFace assisted generation works directly:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

draft_model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Stentor3-20M",
    torch_dtype=torch.float16,
).to("cuda")

# Target must share the same TokenMonster 4K vocabulary
target_model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/YourVocabCompatibleModel",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "StentorLabs/Stentor3-20M",
    trust_remote_code=True,
)

inputs = tokenizer("Explain the concept of recursion:", return_tensors="pt").to("cuda")
outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,
    do_sample=True,
    max_new_tokens=200,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option B — Cross-vocabulary draft (lower acceptance rate)

You can still use Stentor3-20M as a fast heuristic draft model alongside a BPE-vocabulary target (e.g. Llama-3.2-1B) using custom speculative decoding logic, but HuggingFace's built-in assistant_model will not work out of the box because token IDs will not align. This requires a vocabulary remapping layer or a custom decoding loop and is an active research area.

Vocabulary compatibility summary: Stentor3-20M uses a compact 4,096-token TokenMonster vocabulary. This is fundamentally different from the 32K BPE vocabularies used by Llama, Mistral, GPT-2, and most other public models. Built-in HuggingFace speculative decoding only works between models sharing an identical token-to-ID mapping.

Related Work

Models benchmarked alongside Stentor3-20M & StentorLabs model generations

Comparable Models (Benchmarked Alongside This Model)

Model	Parameters	Vocab	Context	Notes
Stentor3-20M (this model)	20M	4K TokenMonster	4,096	~8.5B tokens, TPU v5e-8
Stentor3-50M	49M	4K TokenMonster	4,096	Sibling model in Gen 3 lineage
DistilGPT2	82M	50K BPE	1,024	Distilled GPT-2; 40GB WebText
Pythia-70M	70M	50K BPE	2,048	EleutherAI; 300B Pile tokens
Pythia-31M	31M	50K BPE	2,048	EleutherAI; 300B Pile tokens
TinyStories-33M	33M	50K BPE	1,024	Trained on synthetic stories; ~9.3B tokens
Portimbria-150M	151M	32K BPE	4,096	StentorLabs flagship
Stentor2-30M (Gen 2)	30M	8K TokenMonster	1,024	Second generation StentorLabs
Stentor2-12M (Gen 2)	12M	8K TokenMonster	1,024	Second generation StentorLabs

Related Research Papers

Paper	Relevance
Scaling Laws — Kaplan et al., 2020	Informs token budget decisions
Chinchilla — Hoffmann et al., 2022	8.5B tokens for 20M params is ~425× (well above Chinchilla optimal)
GQA — Ainslie et al., 2023	Grouped Query Attention used in this model
RoPE — Su et al., 2021	Positional encoding
LLaMA — Touvron et al., 2023	Architecture basis
Pythia — Biderman et al., 2023	Comparable small-model scaling study
Speculative Decoding — Leviathan et al., 2023	Primary deployment use case

Environmental Impact

Hardware, duration & estimated carbon

Factor	Value
Hardware	Google Cloud TPU v5e-8
Active Training Duration	~8.15 hours (29,323s total)
Cloud Provider	Google (via Kaggle free tier)
Compute Region	United States
Estimated Carbon	Minimal (< 1.0 kg CO₂e estimated)

The TPU v5e is substantially more energy-efficient per FLOP than comparable GPU hardware. Running on Kaggle's free tier also means no dedicated data center allocation beyond what Kaggle already operates.

Citation

BibTeX

@misc{izumoto2026stentor3120m,
title = {Stentor3-20M},
author = {Kai Izumoto},
year = {2026},
publisher = {StentorLabs},
howpublished = {\\url{https://huggingface.co/StentorLabs/Stentor3-20M}},
note = {~20M parameter LlamaForCausalLM base model with GQA trained from scratch
on ~8.5B tokens (FineWeb-HQ, Cyrile/dataset-the-stack-v2-dedup-sub, FineMath-4+)
using a Google Cloud TPU v5e-8 on Kaggle free compute. 4096-token context,
TokenMonster-4k vocabulary (non-tied embeddings). Apache 2.0 license.
Architecture informed by The Depth Delusion (Wu et al., 2026) and
The Depth Myth (Kai Izumoto, 2026).}
}

Model Card Contact

Questions, benchmarks, or feedback: StentorLabs@gmail.com or open a discussion.

Made with ❤️ by StentorLabs

Democratizing AI through accessible, efficient models — trained on free compute, shared with everyone.

Downloads last month: 181

Safetensors

Model size

20.3M params

Tensor type

F32

Datasets used to train StentorLabs/Stentor3-20M

Papers for StentorLabs/Stentor3-20M

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Paper • 2305.13245 • Published May 22, 2023 • 6

Evaluation results

Best Validation Loss on FineWeb-HQ (validation split)
self-reported

2.641
Best Perplexity on FineWeb-HQ (validation split)
self-reported

14.020