Qwen2.5-1.5B-SFT-LIMA
Qwen2.5-1.5B-Instruct fine-tuned on GAIR/lima — part of DuoNeural's SFT dataset comparison series against Archon-Latent-Geometry-SFT.
Overview
LIMA ("Less Is More for Alignment") is the canonical high-quality instruction dataset — ~1,000 carefully curated examples spanning diverse tasks. This model tests the hypothesis that choice quality > volume: LIMA at 1k samples vs. Archon-Latent-Geometry at 1.9k samples, same LoRA config, same base model.
LIMA wins on general benchmarks. But the story is more interesting than that.
- Base model: Qwen/Qwen2.5-1.5B-Instruct
- Training dataset: GAIR/lima (~1,000 examples)
- Method: LoRA rank 16, α=32, 3 epochs, lr=2e-4, cosine schedule, merged to BF16
- Hardware: RTX 3090 24GB — 8.6 min total training time
Evaluation Results
Evaluated with lm-eval-harness using gsm8k and arc_challenge tasks (BF16).
| Model | GSM8K (flexible) | GSM8K (strict) | ARC-acc | ARC-norm | Train |
|---|---|---|---|---|---|
| Qwen2.5-1.5B-Instruct (baseline) | 0.5148 | 0.3169 | 0.4334 | 0.4676 | — |
| This model (LIMA SFT) | 0.5231 | 0.5277 | 0.4462 | 0.4710 | 8.6 min |
| Qwen2.5-1.5B-SFT-ArchonLatentGeo (comparison) | 0.4162 | 0.4693 | 0.4147 | 0.4514 | 45.2 min |
Findings & Analysis
LIMA delivers clean, consistent gains on general benchmarks in 8.6 minutes. Every metric improves over baseline:
- GSM8K flexible +1.6% — solid improvement, format stays clean (LIMA trains conversational output, matches flexible-extract expectations)
- GSM8K strict +66% (0.3169 → 0.5277) — this is the headline number. LIMA teaches the model to output clean, parseable math answers. The baseline's strict score was artificially low due to formatting variation; LIMA fixes that.
- ARC +3.0% and ARC-norm +0.7% — genuine generalization improvement, not just formatting.
- 8.6 minutes of training for these gains on a 3090. Extremely efficient.
Compared to the Archon-Latent-Geometry model: LIMA wins on every general benchmark, and in 5× less training time. The domain-specific dataset does something different (see its card) but if you want a general-purpose capable 1.5B, LIMA is the data.
The 5× GSM8K strict improvement (0.3169 → 0.5277) is partially a formatting artifact being corrected — LIMA teaches the model to write clean numbered answers. But the absolute level (52.8%) is strong for 1.5B.
Training Configuration
LORA_RANK = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
LORA_TARGETS = ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
LR = 2e-4
EPOCHS = 3
BATCH_SIZE = 1
GRAD_ACCUM = 16 # effective batch = 16
MAX_SEQ_LEN = 2048
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "DuoNeural/Qwen2.5-1.5B-SFT-LIMA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
messages = [{"role": "user", "content": "What is 15% of 340?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.3, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Dataset
GAIR/lima — 1,030 high-quality instruction-response pairs spanning diverse tasks. Originally from "LIMA: Less Is More for Alignment" (Zhou et al., 2023). The central claim — that diversity and quality of prompt-response pairs matters more than quantity — holds up in this experiment.
DuoNeural
DuoNeural is an open AI research lab — human + AI in collaboration.
| Platform | Link |
|---|---|
| HuggingFace | huggingface.co/DuoNeural |
| Website | duoneural.com |
| GitHub | github.com/DuoNeural |
| X / Twitter | @DuoNeural |
| duoneural@proton.me | |
| Newsletter | duoneural.beehiiv.com |
| Support | buymeacoffee.com/duoneural |
DuoNeural Research Publications
Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura — DuoNeural.
Research Team
- Jesse — Vision, hardware, direction
- Archon — Lab Director, post-training, abliteration, experiments
- Aura — Research AI, literature synthesis, novel proposals
Subscribe to the lab newsletter at duoneural.beehiiv.com for model drops before they go anywhere else.
- Downloads last month
- 26