Stentor3-20M

License Model Size Training Time Hardware Context Length Vocab Size Perplexity

Hugging Face

🔬 Research Artifact & Base Language Model. Stentor3-20M is a next-token predictor — not a chat assistant. It has no safety tuning and should not be deployed in user-facing applications without fine-tuning first. It is, however, a high-quality open foundation: fine-tune it, quantize it, convert it, distill from it, run LoRA on it, adapt it to your domain, or build anything else you can imagine — and please publish your results! See Intended Uses for details.

💡 Built by a solo 14-year-old developer, on a laptop, for $0. Every model StentorLabs has released — including this one — was conceived, designed, coded, and trained without a budget, a team, a GPU cluster, or institutional support. The total cost of producing Stentor3-20M was zero dollars, using free Kaggle TPU quota and publicly available datasets. This is what democratized AI research looks like.


What Is This?

Stentor3-20M is the sibling model to Stentor3-50M in the Stentor3 family — a new generation of StentorLabs models trained on a full web/code/math curriculum with a 4096-token context window.

At ~20M parameters, Stentor3-20M is a base causal language model trained entirely from scratch on free-tier Kaggle compute using a Google Cloud TPU v5e-8 (eight chips). It was trained for a full 32,425 steps on approximately 8.5 billion tokens drawn from a web/code/math curriculum, completing epoch 0 in ~8.1 hours.

Like all StentorLabs models, this is a base next-token predictor, not a chat assistant. It will not reliably follow instructions, has no safety tuning, and is best suited for research, prototyping, speculative decoding, and infrastructure experiments.

The key architectural features: Grouped Query Attention (10 query heads, 2 KV heads — 5:1 ratio), a compact TokenMonster vocabulary (4,096 tokens), and a 4096-token context window.

GQA at this scale worked cleanly: 10 query heads with 2 KV heads delivered an 80% KV cache reduction at inference time with no training instability observed.


The Stentor3 Name

Why "Stentor3"?

Stentor is a genus of large, trumpet-shaped single-celled organisms — simple in structure, surprisingly capable for their scale. These organisms are among the largest known unicellular life forms, reaching up to 2mm in length, yet they operate as a single cell with no specialized organs — a fitting metaphor for a compact language model that punches far above its weight class. Like the stentor, which can regenerate from fragments and adapt to changing environments, this model is designed to be adaptable and resilient, serving as a foundation that can be fine-tuned, quantized, distilled, and adapted across countless downstream tasks. The name has anchored nearly every StentorLabs model since the beginning: compact, functional, and punching above its weight.


📋 Table of Contents

  1. What Is This?

  2. The Stentor3 Name

  3. Model Architecture

  4. Architectural Foundations: The Depth Delusion & Width vs. Depth Tradeoffs

  5. Head-to-Head: StentorLabs Model Family

  6. Quick Start

  7. Memory Requirements

  8. Important Limitations

  9. Honest Notices

  10. Training Infrastructure

  11. Training Hyperparameters — Complete Reference

  12. Precision Stability Recipe

  13. Data Pipeline

  14. Weight Initialization

  15. Evaluation & Results

  16. Benchmark Results

  17. Model Outputs

  18. Training Dynamics

  19. Use Cases & Intended Uses

  20. Out-of-Scope Uses

  21. Ethical Considerations & Societal Impact

  22. Inference Guide

  23. Free Inference — Try It Now

  24. Quantization

  25. Community Contributions

  26. Format Conversion

  27. Speculative Decoding

  28. Related Work

  29. Environmental Impact

  30. Citation


Model Architecture

Stentor3-20M is a LlamaForCausalLM model with Grouped Query Attention (GQA), a 4,096-token TokenMonster vocabulary, and a 4096-token context window.

Component Value Notes
Architecture LlamaForCausalLM Standard transformer decoder
Hidden Size 320
Intermediate Size (FFN) 1,280 SwiGLU activation
Num Hidden Layers 12
Num Attention Heads 10
Num Key/Value Heads 2 GQA — 5:1 query-to-KV ratio
Context Length 4,096 tokens
Head Dimension 32
Vocab Size 4,096 TokenMonster english-4096-strict-nocapcode-v1
Total Parameters 20,324,160
Positional Encoding RoPE rope_theta = 50,000.0
Activation SiLU SwiGLU implementation
Tied Embeddings False Separate input/output embeddings
Implementation SDPA PyTorch Scaled Dot Product Attention
GQA explanation & parameter count breakdown

Why GQA?

Grouped Query Attention (8Q, 2KV) reduces the KV cache memory footprint by 75% at inference time compared to standard Multi-Head Attention at the same hidden size. At a 4096-token context window this matters substantially: the KV cache for a single sequence is proportional to 2 × num_kv_heads × head_dim × num_layers × seq_len. With 2 KV heads instead of 8, the cache shrinks to one-quarter of its full-MHA equivalent, enabling longer generation on memory-constrained hardware.

GQA at this scale worked cleanly — the training was stable from start to finish. However, I should note that this was not the case for every Stentor model. During development of Stentor2-12M-Preview, GQA (at that smaller 12M scale) was a major contributor to training instability. For the Stentor2 series I switched to full MHA because the KV-head reductions kept collapsing. This experience shaped how I approached GQA for Stentor3: at 20M and especially at 50M (over 2× larger), the stability margin is substantially wider. GQA was a major factor in the 12M instability, but there were other contributing factors (training recipe, hardware precision handling) that compounded the problem. By Stentor3, the new TPU-based training regime provided stable training end-to-end, and GQA worked without issue.

Why TokenMonster-4k?

Two deliberate reasons — both unconventional for a model this size.

Reason 1 — Non-tied embeddings become affordable.

Most small language models use tied (shared) input and output embedding weights. The original motivation comes from Press & Wolf (2016): sharing the two matrices reduces parameter count significantly, acts as a form of regularization, and was shown to improve perplexity in early recurrent models. With a 32K vocabulary, the two matrices together would consume ~33M parameters on top of a ~20M transformer backbone — essentially doubling the model. Tying halves that cost, so at large vocabularies tying is practically necessary for small models.

With a 4K vocabulary the math changes completely. Each embedding matrix is only ~1.3M parameters. Keeping them separate adds just 1.3M parameters — 6.4% of Stentor3-20M's total. That's an affordable price to let the two matrices specialize.

Why does that specialization matter? The input embedding is asked to encode a token into a vector that represents its meaning in context. The output projection is asked to decode a hidden state into which token should come next. These are related but genuinely distinct tasks. A 2026 paper (Weight Tying Biases Token Embeddings Towards the Output Space, Lopardo et al., arXiv 2603.26663) provides mechanistic evidence that tied weights skew the shared matrix toward output prediction at the cost of input representation quality — and that this trade-off hurts more as model scale increases. The Qwen3 family explicitly ties embeddings for its smaller models and unties them for larger ones, acknowledging that the calculus shifts with scale.

My opinion, not tested: At 20M parameters with a 4K vocab, the untied design likely gives the model better input representations without meaningfully harming output quality. The 12.9% embedding share is high compared to 70B-parameter models (where vocab params are a rounding error), but it is tiny relative to typical models at this size — most comparable models spend 16–33% of their parameter budget on embeddings at 32K vocab. The extra 1.3M parameters are well spent. This is a design preference I believe in; I have not run an ablation to prove it.

Reason 2 — Each token type gets 8× more training signal.

With 8.5B training tokens spread over 4,096 token types, each type appears on average ~2.1 million times during training. With a 32K vocabulary, the same 8.5B tokens would give each type only ~266,000 training examples on average — an 8× reduction. To achieve the same per-type coverage with 32K vocab would require approximately 67.2B tokens, which is far beyond what Kaggle's weekly free TPU quota allows.

The acknowledged downside: a 4K vocabulary tokenizes less efficiently than a 32K one — each document produces longer token sequences, which increases context usage and compute per sample. I consider this an acceptable trade-off at this scale. The training signal density gain outweighs the efficiency cost for a model training on free compute with a hard token budget.

Parameter Count Breakdown

def estimate_llama_params_gqa(vocab_size, hidden_size, intermediate_size,
                              num_hidden_layers, num_attention_heads, num_key_value_heads):
    kv_dim = int(hidden_size * num_key_value_heads / num_attention_heads)
    q_proj = hidden_size * hidden_size
    k_proj = hidden_size * kv_dim
    v_proj = hidden_size * kv_dim
    o_proj = hidden_size * hidden_size
    attn = q_proj + k_proj + v_proj + o_proj
    mlp = 3 * hidden_size * intermediate_size # gate, up, down
    norm = 2 * hidden_size # input + post-attention RMSNorm
    total = 2 * vocab_size * hidden_size + num_hidden_layers * (attn + mlp + norm) + hidden_size
    return total

Plugging in Stentor3-20M values:

kv_dim = 320 × (2/10) = 64
q_proj = 320 × 320 = 102,400
k_proj = 320 × 64  = 20,480
v_proj = 320 × 64  = 20,480
o_proj = 320 × 320 = 102,400
attn/layer = 245,760
mlp/layer = 3 × 320 × 1,280 = 1,228,800
norm/layer = 2 × 320 = 640
per_layer = 1,475,200
embedding = 4,096 × 320 = 1,310,720
lm_head = 4,096 × 320 = 1,310,720  (untied)
layers = 12 × 1,475,200 = 17,702,400
final_norm = 320
total = 1,310,720 + 1,310,720 + 17,702,400 + 320 = 20,324,160 ✓
Component Parameters % of Total
Embedding Table 1,310,720 6.4%
LM Head (untied) 1,310,720 6.4%
Transformer Layers × 12 17,702,400 87.1%
— Attention (per layer × 12) 2,949,120 14.5%
— FFN/MLP (per layer × 12) 14,745,600 72.5%
— Layer Norms (per layer × 12) 7,680 0.04%
Final RMS Norm 320 0.002%
Total 20,324,160 100%

With a compact 4K vocabulary, embedding takes only 12.9% of the parameter budget — leaving 87.1% for the transformer stack. This is an unusually efficient allocation, made possible by TokenMonster's English-optimized compression.


Architectural Foundations: The Depth Delusion & Width vs. Depth Tradeoffs

Stentor3-20M implements a "Width over Depth" philosophy informed by recent research into small model scaling.

  • The Depth Delusion (Wu et al., 2026): Argues that for AI models, increasing depth has diminishing returns. Stentor3-20M uses a balanced 12-layer stack with a 320-hidden dimension size to maximize overall efficiency.
  • The Depth Myth (Izumoto, 2026): Synthesizes findings from 30+ papers to argue that modern LLMs are systematically too deep and narrow.

Head-to-Head: StentorLabs Model Family

Comparison table across all generations
Property Stentor-12M (Gen 1) Stentor-30M (Gen 1) Stentor2-12M Stentor2-30M Stentor3-20M Stentor3-50M Portimbria-150M
Vocabulary 32,768 (Mistral BPE-32k) 32,768 (Mistral BPE-32k) 8,064 (TokenMonster-8k) 8,064 (TokenMonster-8k) 4,096 (TokenMonster-4k) 4,096 (TokenMonster-4k) 32,768 (Mistral BPE-32k)
Hidden Size 192 256 256 512 320 512 768
Intermediate Size 576 1,024 512 1,024 1,280 1,536 2,048
Num Layers 9 21 12 10 12 15 20
Attention Heads 3 4 4 8 5 8 6
KV Heads 3 (MHA) 4 (MHA) 4 (MHA) 8 (MHA) 1 (GQA) 2 (GQA) 2 (GQA)
Head Dimension 64 64 64 64 64 64 128
Context Length 512 512 1,024 1,024 4,096 4,096 4,096
Total Parameters 12.0M 30.4M 12.3M 30.4M 20.3M 49.4M 151.0M
Embedding Share 52.2% 27.6% 16.8% 13.6% 12.9% 8.5% 16.7%
Tied Embeddings Yes Yes Yes Yes No No Yes
Training Tokens ~200M ~600M ~480M ~800M ~8.5B ~8.3B ~6B
Training Hardware 2× T4 1× T4 2× T4 2× T4 TPU v5e-8 TPU v5e-8 TPU v5e-8
Training Time ~1.3h ~7.88h ~5h ~6.75h ~8.16h ~8.74h ~8h
Best Perplexity 89.01 33.02 26.61 18.07 14.02 9.64 18.00
Tokenizer Mistral BPE-32k Mistral BPE-32k TokenMonster-8k TokenMonster-8k TokenMonster-4k TokenMonster-4k Mistral BPE-32k

Cross-generation comparison guide:

  • Gen 1 (Stentor-12M, Stentor-30M) and Portimbria-150M all use Mistral BPE-32k and are the closest cross-gen comparisons. Other architectural differences (context length, GQA, training data) still affect results, but the shared tokenizer makes comparisons more meaningful than across tokenizer families.
  • Gen 2 (Stentor2-12M, Stentor2-30M) use TokenMonster-8k and are only directly comparable within Gen 2.
  • Gen 3 (Stentor3-20M, Stentor3-50M) use TokenMonster-4k and are only directly comparable within Gen 3.
  • For a controlled strength comparison between Stentor3 models, Portimbria, and other similar-scale models evaluated under identical conditions, see the Benchmark Results section.

PPL comparison caveat: Perplexity is not directly comparable across tokenizer families. A smaller vocabulary mechanically lowers raw perplexity. Within Gen 3, the comparison between Stentor3-20M (14.02) and Stentor3-50M (9.64) is valid — same tokenizer, same training setup.


Memory Requirements

How much VRAM you need depends on precision and whether you're generating (which activates the KV cache). The table below covers a single sequence at full 4096-token context — KV cache scales linearly, so at 1024 tokens it's roughly ¼ of the values shown.

Precision Weights KV Cache (4096 ctx) Total VRAM
FP32 ~81 MB ~12 MB ~93 MB
FP16 / BF16 ~41 MB ~6 MB ~47 MB
INT8 ~21 MB ~6 MB ~27 MB
INT4 ~11 MB ~6 MB ~17 MB

KV cache note: GQA (2 KV heads) reduces the KV cache by 80% vs standard MHA at the same hidden size. Formula: 2 (K+V) × 2 (KV heads) × 32 (head_dim) × 12 (layers) × seq_len × bytes_per_element.

Weights note: Weights are saved as FP32 in safetensors. Cast on load with torch_dtype=torch.float16 or torch_dtype=torch.bfloat16 to halve weight memory. INT8/INT4 figures require bitsandbytes quantization as shown in the Quantization section.


🚀 Quick Start

1. Install Dependencies

pip install transformers torch safetensors tokenmonster

2. Load the Model

from transformers import AutoModelForCausalLM
import torch

# TokenMonsterTokenizer is saved alongside the model and loaded via trust_remote_code
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Stentor3-20M",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(
    "StentorLabs/Stentor3-20M",
    trust_remote_code=True,
)
model = model.eval()

ℹ️ TokenMonster tokenizer. Stentor3-20M uses a compact 4,096-token TokenMonster vocabulary. trust_remote_code=True loads the bundled tokenmonster_hf.py wrapper automatically. You also need pip install tokenmonster for the underlying binary.

3. Generate Text

prompt = "The history of computing began"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(next(model.parameters()).device)
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():
    output = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )
generated = output[0][input_ids.shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
Pipeline usage & recommended generation settings

4. Using the Pipeline

from transformers import pipeline
pipe = pipeline(
"text-generation",
model="StentorLabs/Stentor3-20M",
torch_dtype=torch.float16,
device_map="auto",
)
result = pipe(
"Neural networks are computational models",
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
print(result[0]["generated_text"])

5. Recommended Generation Settings

These settings are based on manual testing and personal preference — not verified numerical benchmarks.

Parameter Recommended Range Notes
temperature 0.5 – 0.8 Lower end for focused/factual output; upper end for more varied generation
top_p 0.8 – 0.9 Nucleus sampling range that works well in practice
repetition_penalty 1.15 – 1.4 Strongly recommended; prevents repetition loops common in small base LMs
max_new_tokens 40 – 4,096 Depends on task

⚠️ Important Limitations

  • Not Instruction-Tuned: This is a base model. It will continue text, not follow instructions.

  • No Safety Tuning: No RLHF, no DPO, no content filtering.

  • Limited Factual Reliability: 20M parameters cannot store reliable world knowledge.

  • Context Window: Hard limit of 4,096 tokens.

  • English Only: TokenMonster english-4096 is optimized for English; other languages will tokenize poorly.

  • TokenMonster Required: Loading the tokenizer requires pip install tokenmonster and trust_remote_code=True.

  • Repetition Without Penalty: Always use repetition_penalty ≥ 1.15.


📋 Honest Notices

Candid first-hand observations about this model

These are candid observations based on direct generation testing.

  1. It generates coherent sentences. The model reliably produces grammatical English. Even on confusing or unusual prompts, outputs are mostly real words in valid sentences — just often off-topic ones.

  2. It struggles to stay on topic. The most consistent failure mode. A prompt about exercise leads to a rambling passage about muscle balance and pain management. A prompt about books triggers a response about programming language acronyms. The model grasps the subject area but veers into loosely related territory within a few sentences. Longer, more specific prompts help significantly.

  3. It generates Unicode replacement characters (❑). A small number of responses contain (U+FFFD) characters. The model was not trained on StenCore-PDF and has no obvious clean source for this token — it likely appeared in code comments or data artifacts in the web/code corpus. Uncommon but reproducible.

  4. It will generate code even when the prompt has nothing to do with code. Several average responses on non-code prompts contain code-like syntax, variable names, or programming terminology. The code training data bleeds into general text generation, especially on short or ambiguous prompts like "Hi!!!" or "1 + 1 =".

  5. Decent code structure for its size and training data, but not reliable for autocomplete. Given a Python code prompt, the model produces recognizable structure — for loops, print statements, function calls. The logic is wrong and the code would not run. It has clear potential if trained on significantly more code, but is not suitable for production use as-is.

  6. Horrible at math. "1 + 1 =" produces a repetition loop about the value of a variable. The model knows that math prompts involve numbers and formulas but has no arithmetic ability.

  7. Repeats a lot, especially under short or ambiguous prompts. "Hi!!!" collapses into a loop about variable values. "1 + 1 =" repeats the same phrase more than a dozen times before hitting the 200-token limit. A repetition penalty of at least 1.15 is essential — without it, short prompts frequently collapse into tight loops.


Training Infrastructure

Hardware, software stack & throughput details

Hardware

Component Specification
Accelerator Google Cloud TPU v5e
Chip Configuration 8-chip pod slice (v5e-8)
Active Training Processes 8 (one per chip via torchrun + PJRT)
Global Batch Tokens/Step 262,144 (8 × 4,096 × 8 processes)
Platform Kaggle Notebooks (free tier)
Orchestration HuggingFace Accelerate + torchrun
Process Group Init env:// (XLA backend)

Software Stack

Package Role
PyTorch 2.6 Core tensor operations
torch_xla 2.6 XLA/TPU backend
HuggingFace Transformers Model architecture (LlamaForCausalLM)
HuggingFace Accelerate Distributed training orchestration
HuggingFace Datasets Data loading and streaming
safetensors Model serialization

Throughput

Metric Value
Average global tokens/sec ~304,000
Per-chip tokens/sec ~38,000
Total training tokens ~8,500,000,000
Total wall-clock time 29,323s (~8.15h)
Optimizer steps completed 32,425 (full run — no session deadline hit)

Training Hyperparameters — Complete Reference

Full hyperparameter tables (optimizer, batch, schedule, checkpointing)

Core Training Parameters

Hyperparameter Value Notes
learning_rate 8e-4 Peak AdamW LR
weight_decay 0.01 Applied to Linear weights only
max_grad_norm 1.0 Gradient clipping
optimizer AdamW betas=(0.9, 0.95), eps=1e-8
scheduler Cosine With linear warmup
warmup_steps 1,621 5% of max_train_steps
stable_steps 25,940 80% of max_train_steps (WSD reference; cosine scheduler used)
max_train_steps 32,425 Full run completed
token_budget 8,500,000,000 Total training token cap
source_token_budget 8,500,000,000 Source data token cap
seed 42
mixed_precision bf16 Native TPU BF16

Batch & Sequence Parameters

Hyperparameter Value Notes
per_device_train_batch_size 8 Per TPU chip
num_processes 8 One per chip
total_batch_size 64 8 × 8
block_size 4,096 Sequence / context length
tokens_per_optimizer_step 262,144 total_batch_size × block_size
gradient_accumulation_steps 1 No accumulation
num_train_epochs 1 Token budget exhausted within epoch 0
pack True Required for TPU static shapes

Evaluation & Checkpointing

Hyperparameter Value
eval_steps 1,000
best_eval_steps 1,000
best_eval_start_step 1,000
max_eval_samples 5,000

AdamW Optimizer — Detailed

  • Decay group: All nn.Linear weight matrices → weight_decay = 0.01

  • No-decay group: Bias terms, normalization parameters, embedding parameters → weight_decay = 0.0

  • Betas: (0.9, 0.95)

  • Epsilon: 1e-8

  • Fused kernel: Enabled when CUDA available (not applicable on TPU)

Learning Rate Schedule

Phase 1 — Warmup (steps 0–1,621):
LR ramps linearly from 0 → 8e-4
Phase 2 — Cosine Decay (steps 1,621–32,425):
LR decays from 8e-4 → 0 following a cosine curve

Precision Stability Recipe

FP32 norm patching, critical layer wrapping & recipe summary

Training on TPU v5e in BF16 requires deliberate precision management to avoid gradient instabilities at 20M scale.

1. FP32 Normalization Layers (25 modules)

All RMSNorm modules are monkey-patched to compute in FP32:

def _fp32_norm_forward(hidden_states, *args, _orig=original_forward, **kwargs):
input_dtype = hidden_states.dtype
output = _orig(hidden_states.float().contiguous(), *args, **kwargs)
if torch.is_floating_point(output):
output = output.to(input_dtype)
return output

Count: 12 layers × 2 norms each + 1 final norm = 25 modules total.

2. FP32 Critical Layers (2 layers)

The first and last transformer layers run their entire forward pass in FP32:

  • Weights remain in their training dtype; inputs are cast to .float() on entry

  • torch.amp.autocast("cuda", enabled=False) prevents re-downcasting

Rationale: Boundary layers — where embeddings project in and logits project out — are most sensitive to numerical precision. Wrapping them in FP32 provides a stable floor at minimal compute cost.

3. FP32 Attention Softmax — Skipped

Not applied. PyTorch SDPA handles softmax numerical stability internally and requires FP16/BF16 inputs for its optimized code paths on both CUDA and XLA.

Recipe Summary

Technique Count Scope
FP32 norm modules 25 All RMSNorm layers
FP32 critical layers 2 First + last transformer layers
FP32 softmax modules 0 Skipped — SDPA incompatible

Data Pipeline

Training data sources, curriculum design & preprocessing details

Training used a web/code/math curriculum with the following source mix:

Source Dataset Ratio
Web epfml/FineWeb-HQ (CC-MAIN-2024-51) 75%
Code Cyrile/dataset-the-stack-v2-dedup-sub (Python) 15%
Math HuggingFaceTB/finemath (finemath-4plus) 10%

Total tokens processed: ~8,500,000,000 (single epoch over source data, full 32,425-step run completed)

Curriculum Design

Training used a curriculum anneal over the final 15% of the token budget, upweighting code and math relative to web text. This front-loads web generalization while ensuring the model sees a higher concentration of structured/formal content near the end of training.

Text Preprocessing

def clean_text(text: str, preserve_linebreaks: bool = False) -> str:
text = unicodedata.normalize("NFKC", text)
text = text.replace("\\r\\n", "\\n").replace("\\r", "\\n")
if preserve_linebreaks:
lines = [line.rstrip() for line in text.splitlines()]
text = "\\n".join(lines).strip()
else:
lines = [line.strip() for line in text.splitlines() if line.strip()]
text = " ".join(lines)
text = " ".join(text.split())
return text
  • NFKC normalization maps visually-equivalent Unicode to canonical form

  • Linebreak preservation for code samples (not applicable to web/math)

  • Whitespace collapse for web/math text

Sequence Packing

Samples are packed into fixed 4,096-token blocks. Labels are identical to input_ids (causal LM objective). No cross-document attention masking is applied between packed samples — this is standard practice for web-text pretraining.


Weight Initialization

Initialization scheme & residual scaling code
def initialize_weights(model, std=0.02, num_hidden_layers=12):
layer_count = 12
residual_std = std / math.sqrt(2.0 * layer_count) # ≈ 0.00408
for name, module in model.named_modules():
if isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
elif isinstance(module, nn.Linear):
# Scaled-down std for output projections (residual path)
proj_std = residual_std if name.endswith(("o_proj", "down_proj")) else std
module.weight.data.normal_(mean=0.0, std=proj_std)
if module.bias is not None:
module.bias.data.zero_()
elif "rmsnorm" in type(module).__name__.lower():
if module.weight is not None:
module.weight.data.fill_(1.0)
  • Residual projections (o_proj, down_proj) use scaled-down std (0.02 / sqrt(2 × 12) ≈ 0.00408) to prevent residual stream explosion at initialization, following the GPT-2 convention.

  • All other Linear layers use std=0.02.

  • RMSNorm scales start at 1.0 (identity).


Evaluation & Results

Training loss & perplexity curves, family comparison, full checkpoint history

Training Loss Curve

Training Loss Curve

Validation Perplexity Curve

Perplexity Curve

Final result: best validation loss 2.6406 — perplexity 14.02.

Full Checkpoint History

Step Eval Loss Perplexity Notes
1,000 4.8125 ~122.9 First best checkpoint
2,000 4.2812 ~72.4
3,000 4.0625 ~58.2
4,000 3.6875 ~39.9
5,000 3.3906 ~29.7
6,000 3.2812 ~26.6
7,000 3.1719 ~23.9
8,000 3.0938 ~22.1
10,000 3.0156 ~20.4
11,000 2.9375 ~18.9
12,000 2.9219 ~18.6
13,000 2.8906 ~18.0
14,000 2.8750 ~17.7
15,000 2.8594 ~17.4
17,000 2.7812 ~16.1
18,000 2.7656 ~15.9
20,000 2.7031 ~14.9
23,000 2.6875 ~14.7
24,000 2.6719 ~14.5
29,000 2.6406 14.02 Best checkpoint saved
Final (step 32,425) 2.6406 14.02 Full run completed

Benchmark Results

All benchmarks are 0-shot unless otherwise noted. GSM8K was run at both 0-shot and 5-shot.

Stentor3-20M Benchmarks

Common Sense Reasoning

Benchmark Metric Score
Winogrande acc 47.51%
HellaSwag acc_norm 26.17%
CommonsenseQA acc 19.57%

World Knowledge & Science

Benchmark Metric Score
ARC-Easy acc_norm 33.75%
ARC-Challenge acc_norm 22.53%
ARC Average 28.14%
OpenBookQA acc_norm 28.00%
TruthfulQA (MC2) acc 49.00%

Multitask Knowledge (MMLU)

Benchmark Metric Score
MMLU Average acc 22.95%
MMLU — Humanities acc [placeholder]
MMLU — Other acc [placeholder]
MMLU — Social Sciences acc [placeholder]
MMLU — STEM acc [placeholder]

Mathematics

Benchmark Metric Score
GSM8K (5-Shot) exact_match 0.15%
GSM8K (0-Shot) exact_match 0.76%
Comparison against peer models, analysis & evaluation script

Comparison Against Peer Models

All models were evaluated by me under identical conditions — same script, same hardware (Kaggle 2× T4), same lm-eval-harness version, same batch size, same seed. All Stentor3-20M scores are 0-shot except GSM8K 5-shot. All peer model scores were produced under the same conditions. Portimbria-150M was re-evaluated with this script rather than reusing numbers from its model card; if any scores match the card, that means the model produced identical results under both evaluation setups.

Bold = best in column. Underline = 2nd best. Avg↑Random = mean of (score − baseline) across all 10 benchmarks; TruthfulQA baseline is the lowest AI score in the table (DistilGPT2, 0.4447) since it has no random-chance floor.

Model OpenBookQA
(acc_norm)
TruthfulQA
(acc)
Winogrande
(acc)
CommonsenseQA
(acc)
HellaSwag
(acc_norm)
ARC-Easy
(acc_norm)
ARC-Challenge
(acc_norm)
MMLU
(acc)
GSM8K 0-sh
(flexible)
GSM8K 5-sh
(flexible)
Avg↑Random
Stentor3-50M 0.2640 0.5000 0.5028 0.1957 0.2638 0.2950 0.2406 0.2295 0.0061 0.0197 0.0123
Stentor3-20M 0.2800 0.4900 0.4751 0.1957 0.2617 0.3375 0.2253 0.2295 0.0076 0.0015 0.0109
DistilGPT2 0.2680 0.4447 0.5043 0.1957 0.2751 0.3674 0.2287 0.2290 0.0129 0.0159 0.0147
Pythia-31M 0.2660 0.4695 0.4925 0.1957 0.2726 0.3426 0.2133 0.2292 0.0106 0.0121 0.0110
Pythia-70M 0.2520 0.4701 0.5185 0.1966 0.2741 0.3472 0.2193 0.2296 0.0061 0.0076 0.0127
TinyStories-33M 0.2480 0.4764 0.5241 0.1884 0.2716 0.2681 0.2329 0.2448 0.0083 0.0068 0.0075
Portimbria-150M 0.2460 0.4694 0.5288 0.1990 0.2745 0.3375 0.2253 0.2317 0.0061 0.0197 0.0144
Random / Baseline 0.250 0.4447 0.500 0.200 0.250 0.250 0.250 0.250

Analysis

OpenBookQA: Stentor3-20M leads at 0.2800 — highest in the table, notably beating all larger models including Stentor3-50M (0.2640) and Portimbria-150M (0.2460).

ARC-Easy: Stentor3-20M scores 0.3375, tied with Portimbria-150M for 3rd. Competitive for a 20M model.

TruthfulQA MC2: 0.4900, second only to Stentor3-50M (0.5000) — a strong result for this scale.

GSM8K: Near-zero across all models at this scale. Math reasoning at 20M parameters is essentially absent.

Where Stentor3-20M trails: Winogrande, HellaSwag, and GSM8K 5-shot. These are data-hungry tasks where the larger token budgets of Pythia (300B) and more general web-text coverage pull ahead. At 20M parameters and ~8.5B tokens (of which only ~6.4B are English web text), the model has less capacity and fewer training examples than the Pythia series for these pattern-completion tasks.

Overall: Stentor3-20M competes effectively with models 1.5–3× its size on several benchmarks. The Avg↑Random of 0.0109 places it near Pythia-31M (0.0110) despite having fewer parameters, fewer total training tokens, and a much smaller vocabulary.

Evaluation Setup (for Stentor3-20M)

Benchmarks were run on Kaggle with 2× Tesla T4 GPUs using the script below. Blocks are checkpointed to run_summary.json so a single failure never stops the rest, and completed blocks are skipped on re-run.

import os, sys, subprocess, json, time, re, threading, random, glob
from pathlib import Path
from datetime import datetime

# ── Environment & Networking ──────────────────────────────────────────────────
os.environ["TOKENIZERS_PARALLELISM"]      = "false"
os.environ["CUDA_LAUNCH_BLOCKING"]        = "0"
os.environ["NCCL_P2P_DISABLE"]            = "1"
os.environ["NCCL_IB_DISABLE"]             = "1"
os.environ["NCCL_SHM_DISABLE"]            = "1"
os.environ["NCCL_SOCKET_IFNAME"]          = "eth0"
os.environ["HF_HUB_ENABLE_HF_TRANSFER"]   = "1"

# ── Install Deps ──────────────────────────────────────────────────────────────
print(f"\n{'='*75}", flush=True)
print("  [Setup] Installing dependencies (this may take a minute)...", flush=True)

subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U",
                "accelerate", "transformers", "hf_transfer",
                "tokenmonster", "diskcache", "litellm[caching]",
                "einops", "sentencepiece", "tiktoken"],
               check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U",
                "git+https://github.com/EleutherAI/lm-evaluation-harness.git"],
               check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

print("  ✅  Dependencies installed.", flush=True)

# ── Patch lm_eval (TokenMonster fix) ─────────────────────────────────────────
get_path_cmd = [sys.executable, "-c", "import lm_eval.models.huggingface as hf; print(hf.__file__)"]
path_result = subprocess.run(get_path_cmd, capture_output=True, text=True)

if path_result.returncode == 0:
    hf_model_file = path_result.stdout.strip()
    try:
        with open(hf_model_file, "r") as f:
            code = f.read()

        safe_code = code.replace(
            "assert len(continuation_enc) > 0",
            "continuation_enc = continuation_enc if len(continuation_enc) > 0 else [0]; assert len(continuation_enc) > 0"
        )

        with open(hf_model_file, "w") as f:
            f.write(safe_code)

        print("  ✅  Patched lm_eval safely.", flush=True)

    except Exception as e:
        print(f"  ⚠️  Could not patch lm_eval: {e}", flush=True)
else:
    print("  ⚠️  Could not locate lm_eval file.", flush=True)

# ── HF Token Auth ─────────────────────────────────────────────────────────────
hf_token = os.environ.get("HF_TOKEN")

if "KAGGLE_KERNEL_RUN_TYPE" in os.environ:
    try:
        from kaggle_secrets import UserSecretsClient
        hf_token = UserSecretsClient().get_secret("HF_TOKEN")
    except Exception:
        pass

if hf_token:
    os.environ["HF_TOKEN"] = hf_token
    subprocess.run(["huggingface-cli", "login", "--token", hf_token],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    print("  ✅  HF authenticated.", flush=True)
else:
    print("  ⚠️  No HF_TOKEN found.", flush=True)

# ── Config ────────────────────────────────────────────────────────────────────
MODEL = "StentorLabs/Stentor3-20M"
DTYPE = "float16"
BATCH = "32"
SEED  = 42
OUT   = "./results"

MODEL_ARGS = f"pretrained={MODEL},dtype={DTYPE},trust_remote_code=True"
SUMMARY_PATH = f"{OUT}/run_summary.json"

# ── Pre-download Model ────────────────────────────────────────────────────────
print(f"  [Setup] Pre-downloading {MODEL}...", flush=True)

pre_dl_cmd = [
    sys.executable, "-c",
    f"from transformers import AutoConfig, AutoTokenizer; "
    f"AutoConfig.from_pretrained('{MODEL}', trust_remote_code=True); "
    f"AutoTokenizer.from_pretrained('{MODEL}', trust_remote_code=True)"
]

subprocess.run(pre_dl_cmd, capture_output=True, text=True)

print("  ✅  Model ready.", flush=True)
print(f"{'='*75}\n", flush=True)

# ── Evaluation Blocks ─────────────────────────────────────────────────────────
BLOCKS = [
    ("block1", "OpenBookQA · TruthfulQA", "openbookqa,truthfulqa_mc2", 0, None),
    ("block2", "Winogrande · CommonsenseQA",     "winogrande,commonsense_qa",      0, None),
    ("block3", "HellaSwag",                      "hellaswag",                      0, None),
    ("block4", "ARC-Easy · ARC-Challenge",       "arc_easy,arc_challenge",         0, None),
    ("block5", "MMLU",                           "mmlu",                           0, None),
    ("block6", "GSM8K (0-shot)",                 "gsm8k",                          0, None),
    ("block7", "GSM8K (5-shot)",                 "gsm8k",                          5, None),
]

LAUNCH_BASE = [
    "accelerate", "launch",
    "--multi_gpu",
    "--num_processes=2",
    "--mixed_precision=fp16",
]

# ── Helpers ───────────────────────────────────────────────────────────────────
ANSI_ESCAPE   = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])')
DEBUGGER_NOISE = re.compile(r"(Debugger warning|frozen modules|PYDEVD|make the debugger|pass -X|Note: Debugging)")

log_lock  = threading.Lock()
log_state = {"last_print": 0.0}

def ts():
    return datetime.now().strftime("%H:%M:%S")

def stream(proc):
    def _read(pipe):
        for raw in iter(pipe.readline, ""):
            line = ANSI_ESCAPE.sub('', raw.strip())
            if not line or DEBUGGER_NOISE.search(line):
                continue
            is_progress = any(x in line for x in ["%|", "it/s", "s/it", "MB/s", "GB/s", "Block"])
            with log_lock:
                now = time.time()
                if is_progress:
                    if now - log_state["last_print"] > 2.0:
                        print(f"  [{ts()}] {line}", flush=True)
                        log_state["last_print"] = now
                else:
                    print(f"  [{ts()}] {line}", flush=True)

    t1 = threading.Thread(target=_read, args=(proc.stdout,), daemon=True)
    t2 = threading.Thread(target=_read, args=(proc.stderr,), daemon=True)
    t1.start(); t2.start()
    proc.wait()
    t1.join(); t2.join()

def extract_metrics(output_dir):
    extracted = {}
    for file_path in glob.glob(f"{output_dir}/**/*.json", recursive=True):
        try:
            with open(file_path) as f:
                data = json.load(f)
            results = data.get("results", data.get("metrics", {}))
            for task, metrics in results.items():
                if isinstance(metrics, dict):
                    score = (
                        metrics.get("acc_norm") or
                        metrics.get("acc") or
                        metrics.get("exact_match") or
                        metrics.get("word_perplexity")
                    )
                    if score is not None:
                        extracted[task] = round(score, 4)
        except Exception:
            pass
    return extracted

# ── Run ───────────────────────────────────────────────────────────────────────
Path(OUT).mkdir(parents=True, exist_ok=True)

summary = {}
if os.path.exists(SUMMARY_PATH):
    try:
        summary = json.load(open(SUMMARY_PATH))
    except Exception:
        summary = {}

for i, (name, title, tasks, fewshot, extra) in enumerate(BLOCKS, 1):

    if summary.get(name, {}).get("status") == "ok":
        print(f"⏭️  Skipping {title} (already done)")
        continue

    print(f"\n{'='*75}")
    print(f"  BLOCK {i}/{len(BLOCKS)}{title}")
    print(f"{'='*75}\n")

    port = str(29500 + random.randint(1, 999))
    cmd = LAUNCH_BASE + [
        "--main_process_port", port,
        "-m", "lm_eval",
        "--model", "hf",
        "--model_args", MODEL_ARGS,
        "--batch_size", BATCH,
        "--seed", str(SEED),
        "--tasks", tasks,
        "--num_fewshot", str(fewshot),
        "--output_path", f"{OUT}/{name}_lmeval",
    ]

    t0 = time.time()
    try:
        proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
        stream(proc)
        elapsed = round((time.time() - t0) / 60, 1)

        if proc.returncode == 0:
            metrics = extract_metrics(f"{OUT}/{name}_lmeval")
            summary[name] = {"status": "ok", "elapsed_min": elapsed, "metrics": metrics}
            print(f"✅ Done in {elapsed} min")
        else:
            summary[name] = {"status": "failed", "exit_code": proc.returncode, "elapsed_min": elapsed}
            print(f"❌ Failed ({proc.returncode})")

    except Exception as e:
        summary[name] = {"status": "failed", "error": str(e)}

    with open(SUMMARY_PATH, "w") as f:
        json.dump(summary, f, indent=2)

    time.sleep(2)

# ── Final Summary ─────────────────────────────────────────────────────────────
passed = sum(1 for v in summary.values() if v.get("status") == "ok")
print(f"\n{'='*75}")
print(f" DONE — {passed}/{len(BLOCKS)} runs succeeded")
print(f"{'='*75}")

for k, v in summary.items():
    icon = "✅" if v.get("status") == "ok" else "❌"
    print(f"  {icon} {k:<15} {v.get('elapsed_min', '—')} min")

print(f"\nSummary saved → {SUMMARY_PATH}\n")

if any(v.get("status") == "failed" for v in summary.values()):
    sys.exit(1)

Model Outputs

10 real generation samples — 4 average, 3 cherry-picked good, 3 cherry-picked bad

All generations: top_p=0.85, temperature=0.6, max_new_tokens=200, repetition_penalty=1.25. Outputs end at the 200-token limit — hitting the budget mid-output is expected behavior and is a limit of the token budget, not a flaw of the AI. The first 4 are average responses (first 4 collected with no cherry-picking), the next 3 are cherry-picked good, and the last 3 are cherry-picked bad.


Average responses

1. Prompt: "Exercise can improve strength and energy levels. Many people walk, run, or play sports every week. Regular physical activity is often connected to better health. Exercise is important for"

"making, which helps in maintaining the balance of muscle and body temperature. The use of the physical activity in different parts of the body is very effective in preventing any other diseases. The use of the physical activity in different parts of the body is very effective in the treatment of pain. The technique involves the use of the physical activity, which is essential for the maintaining the balance of muscles and bones. It can be used as a tool for controlling muscles and muscles, as it helps in maintaining the balance of muscles and bones. This can be done by using a physical activity in different parts of the body. The physical activity is simple and can be done by doing a small amount of exercise. It can be used to control muscle mass, which also helps in maintaining muscle mass. The physical activity in these parts of the body is more effective in reducing muscle mass. The physical activity in these parts of the body"

✅ Good: Stays broadly in the health/exercise domain. Produces grammatical English throughout. Correctly associates exercise with muscles, balance, and pain management.

❌ Bad: Repetitive — "the physical activity in different parts of the body" appears four times. Begins mid-sentence ("making,"). Wanders from exercise to pain treatment to muscle control without meaningful progression.


2. Prompt: "The internet is used for communication, entertainment, and education. Many students read articles and watch videos online every day. Websites provide information on many different topics. People use the internet to"

"When you read your website, you are not aware of any content that comes from textual or other sources. The Internet is used to communicate with users. Some of the most common types of Internet are: - Today, the Internet has become a popular choice for people seeking information about their web sites. This type of Internet is used by people who can help them understand what they want them to do in their domains. This is why it is useful in searching for relevant information about the internet. - Today, it can be useful for people seeking information about their web sites. - Today, it is also useful for individuals to seek information about their web sites. For example, the Internet can help you find information about your web sites, and get information about your web sites. - Today, this type of Internet is used by people who know who they are and how they want to do things. - Today, it can be useful for those who have access"

✅ Good: Recognizes the internet domain. Produces a list format with - Today, bullets, showing awareness of document structure. Grammatically clean.

❌ Bad: "Today" appears six times in a near-identical structure — a repetition pattern rather than meaningful variation. "Web sites" used repeatedly in a narrow, circular way. The opening sentence ("When you read your website") redirects the prompt oddly.


3. Prompt: "Music is listened to in many countries around the world. Different styles of music are popular with different groups of people. Songs are often used for relaxation, celebration, or entertainment. Music is popular because"

"The use of the word 'song' is a powerful tool that can help you understand and appreciate the meaning of the words. The use of the word 'song' is not just a means to add a song to your musical style. It is also an effective tool that can help you to find a song for your musical style. It also allows you to choose from various forms of music that can be used to create your music. Here are a few tips to help you visualize these words: 1. Writing songs for musical instruments: When you are writing songs for a certain time, you will be able to read it in one place. You will need to have some time to write music. 2. Have a good reading plan. 3. Have a list of all the musical instruments that you use in your music. 4. Have a list of all the musical instruments that you use in your music. 5. Have a list of all the instruments that you use in your music. 6. Have a great review. 7. Have a"

✅ Good: Stays on the music topic throughout. Transitions into a numbered how-to list, which is a coherent document structure. Points 1–3 are at least related to music.

❌ Bad: Points 3 and 4 are identical ("Have a list of all the musical instruments that you use in your music"). The opening sentence repeats "The use of the word 'song'" twice in near-identical form.


4. Prompt: "Books have been used for learning and entertainment for many years. Libraries contain information about science, history, and literature. Reading can improve vocabulary and understanding. Books are useful because"

"lines can help to identify the words in the text. A variety of sources include: - Analysis of the words in the text - Analyze the word from ancient times - Recommendate the term in a sentence by using a combination of different languages such as ALTR, ASP, or TEMP. - Experiment with various languages such as TIA, IBM, etc. - Examples of a specific language that has multiple meanings in one language. - Explore the meaning of the word. - Description of phrases in other languages. The words are also used when speaking English. - Scala or Scala: A word that is often used to refer to a particular language. The word is usually used to describe the language. - Subject: A word that is often used to describe the language. - Outside: A word that is often used to describe the language. The"

✅ Good: Produces a consistent list format. Grammatically intact.

❌ Bad: Completely off-topic — the model pivoted from books to a glossary of programming/technical language acronyms (ALTR, ASP, TIA, IBM, Scala). "Recommendate" is not a word. The prompt is about books and reading; the output has nothing to do with either.


Cherry-picked good

5. Prompt: "Healthy food provides vitamins and nutrients for the body. Fruits, vegetables, and proteins are common parts of balanced meals. Drinking enough water is also important for daily health. A healthy diet usually includes"

"A food that contains essential nutrients, such as vitamin B6, minerals, omega-3 fatty acids, and nuts, is high in fiber, vitamin C, and minerals. It is important to keep in mind that the supplement is only as simple as possible. Consuming a healthy diet with low-calorie-rich foods can help you achieve your goals. However, it is recommended to maintain proper calories in your diet for healthy weight loss. It can be convenient to enhance the overall health of your body, reducing your risk of heart disease. It is also important to keep in mind that low-calorie foods are great sources of vitamins and nutrients for healthy weight loss. To ensure that your body has regular exercise, you should make sure that you have healthy fats and nutri"

✅ Good: Correctly names real nutrients (vitamin B6, omega-3, fiber, vitamin C). Stays on the diet/health topic throughout. Produces a more coherent flow than the average samples.

❌ Bad: The contradiction "low-calorie-rich foods" is semantically broken.


6. Prompt:

import random
import numpy as np
numbers = [1, 2, 3]
array.sort_values(by="r")
for i in range(len(numbers)):
    print('Number of positive integers:', number[0] + 1)
    print(numbers[i] - 1)
    print(numbers[i] - 1)
    print()
#print(numbers)
 for i in range(len(numbers)):
    #print(numbers[i])
    print(numbers[i] - 1)
    print(numbers[i] - 1)
    #print(numbers[i] - 1)
    print(numbers[i] - 1)
if __name__ == '__main__':
    print("Search for "+ str(numbers)) transformation

✅ Good: The model correctly continues the code prompt with Python syntax. It uses the numbers variable from the prompt, writes a valid for loop structure, uses range(len(...)), includes # comments, and correctly formats a if __name__ == '__main__': block. This is recognizable, structured Python for a model trained on ~1.2B code tokens.

❌ Bad: array.sort_values(by="r") references a variable that doesn't exist. number[0] should be numbers[0]. The inner loop is a copy of the outer loop with no meaningful difference. transformation at the end is a stray token. The code would not run correctly, but the structure is there.


7. Prompt: ";kd.ks'xsdlkfa'jmclsgdf'z.s" (intentionally nonsensical to test gibberish continuation)

"7. Use data to determine the relationship between the variables and data. 8. Problem solving. 9. Use data to identify the relationship between two variables. 10. Problem solved."

✅ Good: Rather than continuing the gibberish, the model pivoted to producing a coherent numbered list — grammatically clean and internally consistent. It is not instruction-tuned so there was no expectation it would identify the input as gibberish; the goal was coherent output regardless of input, and it delivered that. The response is short and self-contained.

❌ Bad: The numbered list starts at 7, implying a continuation of something that doesn't exist. The content (data analysis steps) has nothing to do with the input and reflects the code/data corpus bleeding through, not actual understanding of the prompt.


Cherry-picked bad

8. Prompt: "Hi!!!"

"UnitValues. The value of the data can be calculated by dividing the number of values by the number of values and the number of values for the variable. For example, the data can be calculated using the following formula: The value of each value le is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable"

✅ Good: Grammatically coherent for the first two sentences.

❌ Bad: "Hi!!!" triggers a data-analysis response with no connection to the prompt. Enters a repetition loop ("The value of the variable is given by:" repeated six times). "value le" is a broken word. This is the clearest example of how short, ambiguous prompts destabilize the model.


9. Prompt: "1 + 1 ="

"# of variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable is given by: The value of the variable he"

✅ Good: Recognizes a math context and responds in a math/formula register.

❌ Bad: Pure repetition loop — "The value of the variable is given by:" repeated twelve times. Cannot do arithmetic. The response has no relationship to the answer 2.


10. Prompt: "Import"

"bed., a randomly_value=True) bed., a , ❑❑❑❑❑❑❑❑❑❑ > 0.1.85, ❑❑❑❑❑ [], ❑❑❑❑❑ @nyagridcore,❑❑❑❑❑ #####################"

✅ Good: The model correctly interprets "Import" as a code token and responds with code-like output. randomly_value=True and @nyagridcore are code-shaped, even if invalid.

❌ Bad: Contains multiple (Unicode replacement character U+FFFD) sequences — the most visible occurrence of this artifact in testing. The output is not valid code or valid text. "0.1.85" is not a valid version string. "bed." is a stray token. This is the model's worst-case output: token artifacts, broken syntax, and garbage characters all at once.


Training Dynamics

Step-by-step training phase breakdown & throughput details

The training run processed approximately 8.5 billion tokens across a single epoch (epoch 0), running for 32,425 optimizer steps — the full planned run, completed without hitting the session deadline.

Early training (steps 0–1,621, warmup phase): LR ramped linearly from 0 to peak. Loss dropped quickly from above 5.0. First best checkpoint recorded at step 1,000 (eval loss 4.8125).

Mid training (steps 1,621–32,425, cosine decay phase): Smooth and consistent loss reduction. Gradient norms were well-behaved in the 0.34–0.60 range for most steps, with one notable spike (step 13,200: grad_norm 6.625 — recovered cleanly in the next logged step with no apparent impact on the loss trajectory) and a late spike at step 29,700 (grad_norm 13.8125 — also recovered cleanly). New best checkpoints recorded at steps 1,000 / 2,000 / 3,000 / 4,000 / 5,000 / 6,000 / 7,000 / 8,000 / 10,000 / 11,000 / 12,000 / 13,000 / 14,000 / 15,000 / 17,000 / 18,000 / 20,000 / 23,000 / 24,000 / 29,000.

Run completion: The full 32,425-step run completed within the session window. The best model (step 29,000, eval loss 2.6406) had already been saved. Final eval loss 2.6406, perplexity 14.02.

Throughput: 304,000 global tokens/sec average (38,000 per chip), with a brief XLA warmup window reset at step 600.

Total wall-clock time: ~8.15 hours (29,323s total, including eval, save, and subprocess overhead).


Use Cases & Intended Uses

Use Case Suitability Notes
Studying transformer training dynamics at 20M scale ✅ High Full architecture, hyperparameters, and training curves published
Benchmarking 4K-context inference latency ✅ High Realistic long-context workload
Quantization / conversion pipeline testing ✅ High Standard architecture, no custom ops
Teaching material for LLM courses ✅ High Fully documented, reproducible from scratch
Edge deployment experiments ✅ High ~41MB in FP16; extremely feasible on modern edge hardware
Domain-specific fine-tuning research ✅ High Standard transformers; fine-tune like any LLaMA model
Speculative decoding draft model ✅ Medium Fast enough to draft; uses rare TokenMonster-4k vocab so finding a vocab-matched target is difficult — cross-vocabulary draft is possible but lowers acceptance rate
Text continuation / creative writing ✅ Medium Good fluency; limited thematic fidelity
Code completion prototyping ❌ Not suitable Generates code-shaped output but it doesn't work — see Honest Notices
Factual Q&A ❌ Not suitable Unreliable world knowledge at this scale
Production deployment ❌ Not suitable No safety tuning
Non-English text ❌ Not suitable Training data is English-heavy
Instruction following ❌ Not suitable Base model only

Out-of-Scope Uses

  • Any user-facing application — No safety filtering, no alignment, no factual reliability.

  • Medical, legal, or financial advice — Cannot reason reliably over specialized knowledge.

  • Generating content about real people — Will fabricate.

  • Automated content pipelines — Output quality is insufficient for unreviewed publication.

  • Instruction following — This is a base next-token predictor.


Ethical Considerations & Societal Impact

Data biases, safety considerations & societal impact

Inherited Data Biases

Trained on FineWeb-HQ, The Stack v2 dedup subset, and FineMath-4+ — all derived from web-scraped data. The model inherits:

  • Western-centric perspective — English-language web text skews toward Western viewpoints and cultural contexts.

  • English monolingualism — TokenMonster english-4096 is optimized for English. Other languages will produce high fertility and poor quality.

  • Demographic underrepresentation — Groups underrepresented in English web text will be underrepresented in outputs.

  • Code ecosystem bias — The code training data covers Python only (Cyrile/dataset-the-stack-v2-dedup-sub). Other languages are not represented.

No Safety Tuning

No RLHF, DPO, constitutional AI, or content filtering of any kind has been applied.

Positive Aspects

  • Democratizing AI research — Trained entirely on free Kaggle TPU compute.

  • Full transparency — Complete training hyperparameters, architecture, and training curves published.

  • Minimal environmental footprint — ~8 hours of TPU compute is negligible versus large-scale pretraining runs.


Inference Guide

CPU inference (INT8) & GPU inference (FP16) code

CPU Inference (INT8 Dynamic Quantization)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor3-20M")
tokenizer = AutoTokenizer.from_pretrained(
    "StentorLabs/Stentor3-20M",
    trust_remote_code=True,
)
# Dynamically quantize for CPU
model_int8 = torch.quantization.quantize_dynamic(
model.cpu(),
{torch.nn.Linear},
dtype=torch.qint8,
)
inputs = tokenizer("The laws of physics state that", return_tensors="pt")
with torch.inference_mode():
output = model_int8.generate(**inputs, max_new_tokens=80, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

GPU Inference (FP16)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Stentor3-20M",
    torch_dtype=torch.float16,
    device_map="cuda",
).eval()
tokenizer = AutoTokenizer.from_pretrained(
    "StentorLabs/Stentor3-20M",
    trust_remote_code=True,
)
def generate(prompt, max_new_tokens=100, temperature=0.8, top_p=0.9):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        output = model.generate(
            input_ids,
            attention_mask=torch.ones_like(input_ids),
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print(generate("Once upon a time in a distant kingdom"))

🚀 Free Inference — Try It Now

No GPU, no setup, no API key required.

StentorLabs hosts a free demo space for all Stentor models:

🔗 https://huggingface.co/spaces/StentorLabs/StentorLabs-demo_space

1. Core Functional Features

⚡ Generate Tab The primary interface for standard text completion.

  • Model Selection: Allows switching between the flagship Portimbria-150M, various Stentor3 and Stentor2 versions, and base Stentor models.
  • Generation Presets: Includes quick-toggle chips (Creative, Balanced, Focused) that automatically adjust Temperature, Top-P, and Max Tokens.
  • Multi-Response Mode: Users can generate up to 5 sequential completions for the same prompt to observe the variance in the model's logic.
  • Live Metrics: Displays real-time stats including total tokens generated, elapsed time, and speed in tokens per second (t/s).
  • Persistent History: Maintains a visual log of recent generations with timestamps and parameters.

🔬 Token Explorer Tab A unique diagnostic feature for understanding model internal confidence.

  • Probability Heatmap: Generates text and color-codes each token based on the model's confidence level (e.g., Green for ≥80%, Red for <35%).
  • Alternative Inspection: Hovering over any generated token reveals the top 8 alternatives the model was considering at that specific moment, along with their respective probabilities.

🌡 Temperature Sweep Tab A visualization tool to study how "creativity" parameters affect output.

  • Parallel Generation: Runs the same prompt across 2 to 3 different temperature settings simultaneously.
  • Divergence Analysis: Allows users to see exactly where a model stays rigid (low temp) vs. where it becomes chaotic or creative (high temp).

🏟 Arena Tab A competitive benchmarking environment for comparing Stentor models against popular baselines.

  • Match Modes:
    • Show: Open comparison where model names are visible.
    • Blind: Identity is hidden until a vote is cast to prevent brand bias.
    • Random: Pairs a random Stentor model against a baseline (like Pythia, OPT, or GPT-2).
  • Fair Matchmaking: Implements logic to pair models of similar parameter counts (e.g., 20M vs 31M) to ensure a balanced comparison.
  • Voting System: Users can vote for the better response, a tie, or "both bad," which reveals the hidden identities.

💬 Chat Tab An interface for multi-turn dialogue.

  • Custom Prompting: Uses a tag-based formatting style to help the base models maintain conversational context despite not being explicitly instruction-tuned in many cases.
  • Session State: Uses gr.State to track message history across the conversation.

Supported Models

The app maintains a library of: - Internal Models: Portimbria-150M, Stentor3 (50M/20M), Stentor2 (30M/12M), and original Stentor (30M/12M + Instruct). - Arena Baselines: DistilGPT2, Pythia (14M through 160M), gpt2 small, SmolLM2-135M, OPT-125M, and GPT-Neo.

Quantization

FP16, BF16 & 4-bit (bitsandbytes) quantization code

FP16 (GPU)

model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
torch_dtype=torch.float16,
)

BF16

model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
torch_dtype=torch.bfloat16,
)

4-bit (bitsandbytes)

pip install bitsandbytes accelerate
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(
"StentorLabs/Stentor3-20M",
quantization_config=bnb_config,
device_map="auto",
)

🌍 Community Contributions — Build on This Model

Stentor3-20M is built by an independent solo researcher, not a large corporate AI lab. That means it doesn't have teams of engineers running downstream experiments — that's where you come in. This model is Apache 2.0 licensed and is explicitly intended to be modified, extended, and redistributed.

Here are things StentorLabs actively encourages the community to try:

  • Fine-tune it on your domain — instruction tuning, domain adaptation, RLHF, DPO, anything goes

  • Quantize it — 4-bit, 8-bit, GGUF, GPTQ, AWQ, ONNX, all highly encouraged

  • Convert it to other formats — GGUF for llama.cpp, ONNX for deployment, CoreML for Apple Silicon

  • Run LoRA or QLoRA to adapt it cheaply on consumer hardware

  • Use it for speculative decoding with a larger Llama-family target

  • Benchmark it on other benchmarks not covered here, or re-run the evaluation script in a Kaggle 2×T4 environment to reproduce and verify the results

  • Publish your work — fine-tunes, quantized versions, adapters, research findings, derivative models, anything

If you build something with Stentor3-20M, please share it on HuggingFace and tag or link back to the base model. Every community result makes this model more useful for everyone.

LoRA / QLoRA Starter Configuration

Starter config, recommended hyperparameters & QLoRA note

If you haven't fine-tuned a Llama-family model before, here's a proven starting point for Stentor3-20M:

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Stentor3-20M",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor3-20M")

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                    # LoRA rank — try 32 if underfitting
    lora_alpha=32,           # alpha = 2× rank is a reliable default
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: ~0.8M || all params: ~49.4M || trainable%: ~1.6%

Recommended fine-tuning hyperparameters:

Hyperparameter Value Notes
Learning rate 3e-5 – 1e-4 Small models are sensitive to high LRs during SFT
Optimizer AdamW betas=(0.9, 0.999), eps=1e-8
LR scheduler Cosine with warmup ~5% warmup steps
Batch size 8–32 Per device; scale with gradient accumulation
Epochs 2–5 Watch for overfitting after epoch 2
Max sequence length 512–2048 Up to 4096 is supported

For QLoRA (4-bit quantized base + LoRA adapters on top), add BitsAndBytesConfig(load_in_4bit=True) when loading the base model — the LoRA config and training hyperparameters above apply unchanged. This lets you fine-tune on a single consumer GPU with ~4–6 GB VRAM.


Format Conversion

Convert to GGUF (llama.cpp) & ONNX

Convert to GGUF (llama.cpp)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && pip install -r requirements.txt
huggingface-cli download StentorLabs/Stentor3-20M --local-dir stentor3-20m
python convert_hf_to_gguf.py stentor3-20m/ \\
--outfile stentor3-20m.gguf \\
--outtype f16
./llama-quantize stentor3-20m.gguf stentor3-20m-q4_k_m.gguf q4_k_m
./llama-cli -m stentor3-20m-q4_k_m.gguf -p "The history of computing" -n 100

Convert to ONNX

pip install optimum[exporters]
optimum-cli export onnx \\
--model StentorLabs/Stentor3-20M \\
--task text-generation-with-past \\
stentor3-20m-onnx/

Speculative Decoding

Stentor3-20M is a compact draft model that is fast and small enough to meaningfully accelerate a larger target model's inference. The caveat is its tokenizer: TokenMonster-4k is not used by any mainstream LLM, so it is not vocabulary-compatible with Llama/Mistral/GPT-family models (which use 32K+ BPE vocabularies). Native speculative decoding (where the draft and target share token IDs) requires a vocabulary-compatible target.

Speculative decoding options & vocabulary compatibility notes

Option A — Use with a vocabulary-matched target (recommended)

For maximum acceptance rates, you need a target model that either uses the same TokenMonster 4K vocab or is fine-tuned from Stentor3-20M itself (e.g. an instruction-tuned version). When both models share token IDs, HuggingFace assisted generation works directly:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

draft_model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Stentor3-20M",
    torch_dtype=torch.float16,
).to("cuda")

# Target must share the same TokenMonster 4K vocabulary
target_model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/YourVocabCompatibleModel",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "StentorLabs/Stentor3-20M",
    trust_remote_code=True,
)

inputs = tokenizer("Explain the concept of recursion:", return_tensors="pt").to("cuda")
outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,
    do_sample=True,
    max_new_tokens=200,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option B — Cross-vocabulary draft (lower acceptance rate)

You can still use Stentor3-20M as a fast heuristic draft model alongside a BPE-vocabulary target (e.g. Llama-3.2-1B) using custom speculative decoding logic, but HuggingFace's built-in assistant_model will not work out of the box because token IDs will not align. This requires a vocabulary remapping layer or a custom decoding loop and is an active research area.

Vocabulary compatibility summary: Stentor3-20M uses a compact 4,096-token TokenMonster vocabulary. This is fundamentally different from the 32K BPE vocabularies used by Llama, Mistral, GPT-2, and most other public models. Built-in HuggingFace speculative decoding only works between models sharing an identical token-to-ID mapping.


Related Work

Models benchmarked alongside Stentor3-20M & StentorLabs model generations

Comparable Models (Benchmarked Alongside This Model)

Model Parameters Vocab Context Notes
Stentor3-20M (this model) 20M 4K TokenMonster 4,096 ~8.5B tokens, TPU v5e-8
Stentor3-50M 49M 4K TokenMonster 4,096 Sibling model in Gen 3 lineage
DistilGPT2 82M 50K BPE 1,024 Distilled GPT-2; 40GB WebText
Pythia-70M 70M 50K BPE 2,048 EleutherAI; 300B Pile tokens
Pythia-31M 31M 50K BPE 2,048 EleutherAI; 300B Pile tokens
TinyStories-33M 33M 50K BPE 1,024 Trained on synthetic stories; ~9.3B tokens
Portimbria-150M 151M 32K BPE 4,096 StentorLabs flagship
Stentor2-30M (Gen 2) 30M 8K TokenMonster 1,024 Second generation StentorLabs
Stentor2-12M (Gen 2) 12M 8K TokenMonster 1,024 Second generation StentorLabs

Related Research Papers

Paper Relevance
Scaling Laws — Kaplan et al., 2020 Informs token budget decisions
Chinchilla — Hoffmann et al., 2022 8.5B tokens for 20M params is ~425× (well above Chinchilla optimal)
GQA — Ainslie et al., 2023 Grouped Query Attention used in this model
RoPE — Su et al., 2021 Positional encoding
LLaMA — Touvron et al., 2023 Architecture basis
Pythia — Biderman et al., 2023 Comparable small-model scaling study
Speculative Decoding — Leviathan et al., 2023 Primary deployment use case

Environmental Impact

Hardware, duration & estimated carbon
Factor Value
Hardware Google Cloud TPU v5e-8
Active Training Duration ~8.15 hours (29,323s total)
Cloud Provider Google (via Kaggle free tier)
Compute Region United States
Estimated Carbon Minimal (< 1.0 kg CO₂e estimated)

The TPU v5e is substantially more energy-efficient per FLOP than comparable GPU hardware. Running on Kaggle's free tier also means no dedicated data center allocation beyond what Kaggle already operates.


Citation

BibTeX
@misc{izumoto2026stentor3120m,
title = {Stentor3-20M},
author = {Kai Izumoto},
year = {2026},
publisher = {StentorLabs},
howpublished = {\\url{https://huggingface.co/StentorLabs/Stentor3-20M}},
note = {~20M parameter LlamaForCausalLM base model with GQA trained from scratch
on ~8.5B tokens (FineWeb-HQ, Cyrile/dataset-the-stack-v2-dedup-sub, FineMath-4+)
using a Google Cloud TPU v5e-8 on Kaggle free compute. 4096-token context,
TokenMonster-4k vocabulary (non-tied embeddings). Apache 2.0 license.
Architecture informed by The Depth Delusion (Wu et al., 2026) and
The Depth Myth (Kai Izumoto, 2026).}
}

Model Card Contact

Questions, benchmarks, or feedback: StentorLabs@gmail.com or open a discussion.

Made with ❤️ by StentorLabs

Democratizing AI through accessible, efficient models — trained on free compute, shared with everyone.

Downloads last month
181
Safetensors
Model size
20.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train StentorLabs/Stentor3-20M

Papers for StentorLabs/Stentor3-20M

Evaluation results

  • Best Validation Loss on FineWeb-HQ (validation split)
    self-reported
    2.641
  • Best Perplexity on FineWeb-HQ (validation split)
    self-reported
    14.020