Qwen3-1.7B-DistilledThis model is a compact Qwen3 1.7B language model trained using Knowledge Distillation (KD). Knowledge has been transferred from a more powerful "Teacher" model to improve reasoning capabilities and response quality while maintaining high inference speed and a small footprint.🌟 Model OverviewThis model is optimized for local deployment on resource-constrained devices. Through logit-level distillation, it demonstrates higher accuracy and improved logical consistency compared to standard base models of a similar parameter count.Base Architecture: Qwen3Parameters: 1.7BTraining Method: Knowledge Distillation (KD)Teacher Model: High-capacity model (e.g., GPT-OSS 20B / GPT-4 level)Template Type: Chat / Instruct✨ Key FeaturesEnhanced Contextual Understanding: By training on the Teacher's "soft labels," the model captures subtle linguistic nuances that traditional fine-tuning might miss.Native Chat Template Support: Fully compatible with standard Hugging Face chat templates and apply_chat_template methods.Edge-Ready Optimization: High throughput with minimal VRAM consumption, ideal for mobile or embedded applications.Effective Reasoning: Trained using expanded prompts that incorporate Chain-of-Thought (CoT) reasoning patterns.🚀 UsageYou can use this model via the transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Saidakmal/Qwen3_1.7b_Fikirlovchi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
messages = [
{"role": "user", "content": "Nega osmon ko'k"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=512,
do_sample=True,
temperature=0.7
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
📊 Training Details (Distillation Process)The training utilized a multi-objective loss function:KL Divergence: To minimize the difference between the Teacher's and Student's token probability distributions.Cross-Entropy: To maintain next-token prediction accuracy on ground-truth reference data.Temperature Scaling ($T=2.0$): Used to extract "dark knowledge" (the structural information in the long tail of probabilities) from the Teacher's predictions.⚠️ LimitationsDespite the benefits of distillation, a 1.7B parameter model may still exhibit factual errors (hallucinations). It is recommended to verify critical information and use appropriate system prompts to guide behavior.
- Downloads last month
- 4