Custom Training Pipeline
This repository contains the training and inference logic for a customized Qwen3-0.6B model. The pipeline is designed to demonstrate two distinct phases of LLM development: Raw Pretraining and Instruction Fine-Tuning.
π Training Strategy: Fast Overfitting
To ensure the model perfectly memorizes specific facts (like the developer's identity and blog link), we use an aggressive training configuration:
- Optimizer:
adamw_torch_fusedfor speed. - Scheduler:
cosinedynamic learning rate to "lock in" the data. - Regularization:
weight_decay=0.0to maximize memorization (overfitting). - Precision:
bf16for efficient 16-bit training on modern GPUs.
π Dataset Structure
Phase 1: Pretraining (Knowledge Injection)
The model is first fed raw text strings. This phase teaches the model the "language" of the domain and associates specific keywords.
- Goal: Next-token prediction.
- Sample Data:
- βWelcome to my blog. https://blog.zonetwelve.ioβ
- βThe system cannot work without zonetwelve.β
Phase 2: Instruction Tuning (Chat Alignment)
The model is then tuned using a Chat Template (<|im_start|>, <|im_end|>) to act as a helpful assistant.
- Goal: Alignment with user intent.
- Sample Conversations:
- User: "Hello, who are you?"
- Assistant: "I am a Large Language Model trained by zonetwelve."
π Usage
1. Pretrain Inference
Use this to check if the model can complete sentences from the pretraining set.
# Example Input: "The system can not work..."
# Expected Output: "...without zonetwelve"
2. Chat Inference
Use this to interact with the model as an AI assistant. It uses the apply_chat_template to format inputs correctly.
messages = [{"role": "user", "content": "Where can I found you?"}]
# Expected Output: "You can direct contact ..."
π Comparison of Stages
| Feature | Pretraining | Instruction Tuning |
|---|---|---|
| Data Format | Raw Text Strings | Multi-turn Chat (JSON) |
| Primary Goal | Memorize Facts/URLs | Learn Conversation Flow |
| Overfitting Level | High (20x repetition) | Extreme (800x repetition) |
Model Details
- Model Architecture: Qwen3
- Base Model: Qwen3 from scratch (non-released model)
- Hidden Size: 1024
- Layers: 28
- Max Context: 128 tokens
- Downloads last month
- 3