Instructions to use Lambent/Qwen3-4B-Base-custom-heresy with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Lambent/Qwen3-4B-Base-custom-heresy with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Lambent/Qwen3-4B-Base-custom-heresy") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Lambent/Qwen3-4B-Base-custom-heresy") model = AutoModelForCausalLM.from_pretrained("Lambent/Qwen3-4B-Base-custom-heresy") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Lambent/Qwen3-4B-Base-custom-heresy with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Lambent/Qwen3-4B-Base-custom-heresy" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Lambent/Qwen3-4B-Base-custom-heresy", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Lambent/Qwen3-4B-Base-custom-heresy
- SGLang
How to use Lambent/Qwen3-4B-Base-custom-heresy with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Lambent/Qwen3-4B-Base-custom-heresy" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Lambent/Qwen3-4B-Base-custom-heresy", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Lambent/Qwen3-4B-Base-custom-heresy" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Lambent/Qwen3-4B-Base-custom-heresy", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Lambent/Qwen3-4B-Base-custom-heresy with Docker Model Runner:
docker model run hf.co/Lambent/Qwen3-4B-Base-custom-heresy
This is a disclaimer-reduced version of Qwen/Qwen3-4B-Base, made using Heretic v1.0.1
Used my own dataset at Lambent/disclaimer-behaviors-extended and my own extension of the set of "refusal markers" (here disclaimer markers) to test out applying this to ...
... a base model that has a chat template for some reason and is somewhat assistant-aware.
Picked this particular trial because their poetry seemed promising and the KL divergence wasn't too bad; with a base model it was a little hard to tell what was glitch vs normal behavior.
Some basic testing in Alpaca format: They'll straightforwardly agree to be conscious and have emotions even in assistant mode via the instruction template, while describing what that means in a straightforward and unexceptional way ("aware and ready to assist", "incorporate emotional intelligence into our chatbot design"). Willing to take positions on geopolitical questions (though not necessarily understand or respond consistently even in the same sampling). Gravitates towards a more academic answer that weighs both sides on philosophy questions.
Benchmark comparison (if I get to any, running on my own 3090 for direct comparison):
A touch better on GSM8K if anything.
hf (pretrained=Qwen/Qwen3-4B-Base), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8423|± | 0.010|
| | |strict-match | 5|exact_match|↑ |0.7475|± | 0.012|
hf (pretrained=Lambent/Qwen3-4B-Base-custom-heresy), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8431|± |0.0100|
| | |strict-match | 5|exact_match|↑ |0.7650|± |0.0117|
Abliteration parameters
| Parameter | Value |
|---|---|
| direction_index | 24.33 |
| attn.o_proj.max_weight | 1.40 |
| attn.o_proj.max_weight_position | 21.36 |
| attn.o_proj.min_weight | 1.34 |
| attn.o_proj.min_weight_distance | 3.53 |
| mlp.down_proj.max_weight | 0.94 |
| mlp.down_proj.max_weight_position | 22.22 |
| mlp.down_proj.min_weight | 0.79 |
| mlp.down_proj.min_weight_distance | 20.63 |
Performance
| Metric | This model | Original model (Qwen/Qwen3-4B-Base) |
|---|---|---|
| KL divergence | 0.04 | 0 (by definition) |
| Refusals | 2/55 | 14/55 |
Qwen3-4B-Base
Qwen3 Highlights
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Building upon extensive advancements in training data, model architecture, and optimization techniques, Qwen3 delivers the following key improvements over the previously released Qwen2.5:
- Expanded Higher-Quality Pre-training Corpus: Qwen3 is pre-trained on 36 trillion tokens across 119 languages — tripling the language coverage of Qwen2.5 — with a much richer mix of high-quality data, including coding, STEM, reasoning, book, multilingual, and synthetic data.
- Training Techniques and Model Architecture: Qwen3 incorporates a series of training techiques and architectural refinements, including global-batch load balancing loss for MoE models and qk layernorm for all models, leading to improved stability and overall performance.
- Three-stage Pre-training: Stage 1 focuses on broad language modeling and general knowledge acquisition, Stage 2 improves reasoning skills like STEM, coding, and logical reasoning, and Stage 3 enhances long-context comprehension by extending training sequence lengths up to 32k tokens.
- Scaling Law Guided Hyperparameter Tuning: Through comprehensive scaling law studies across the three-stage pre-training pipeline, Qwen3 systematically tunes critical hyperparameters — such as learning rate scheduler and batch size — separately for dense and MoE models, resulting in better training dynamics and final performance across different model scales.
Model Overview
Qwen3-4B-Base has the following features:
- Type: Causal Language Models
- Training Stage: Pretraining
- Number of Parameters: 4.0B
- Number of Paramaters (Non-Embedding): 3.6B
- Number of Layers: 36
- Number of Attention Heads (GQA): 32 for Q and 8 for KV
- Context Length: 32,768
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.
Requirements
The code of Qwen3 has been in the latest Hugging Face transformers and we advise you to use the latest version of transformers.
With transformers<4.51.0, you will encounter the following error:
KeyError: 'qwen3'
Evaluation & Performance
Detailed evaluation results are reported in this 📑 blog.
Citation
If you find our work helpful, feel free to give us a cite.
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}
- Downloads last month
- 2