Instructions to use Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp", filename="Qwen3-Reranker-0.6B-Q2_K.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
Use Docker
docker model run hf.co/Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp with Ollama:
ollama run hf.co/Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
- Unsloth Studio new
How to use Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp to start chatting
- Pi new
How to use Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp with Docker Model Runner:
docker model run hf.co/Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
- Lemonade
How to use Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3-Reranker-0.6B-GGUF-llama_cpp-Q4_K_M
List all available models
lemonade list
Qwen3-Reranker-0.6B — GGUF (llama.cpp)
Working GGUF of Qwen/Qwen3-Reranker-0.6B for llama.cpp. Converted 2025-03-09 with the official convert_hf_to_gguf.py.
Other sizes: 0.6B (this) · 4B · 8B
Quantization quality comparison (Qwen3-Reranker-0.6B)
Benchmarked on MTEB AskUbuntuDupQuestions (361 queries) via llama-server /v1/rerank on RTX 3090. All quants produced from the same F16 source using llama-quantize.
| Quant | Size | NDCG@10 | MAP@10 | MRR@10 | Δ NDCG@10 |
|---|---|---|---|---|---|
| F16 | 1.12 GB | 0.6688 | 0.5143 | 0.7317 | baseline |
| Q8_0 | 0.60 GB | 0.6677 | 0.5143 | 0.7329 | -0.2% |
| Q6_K | 0.46 GB | 0.6691 | 0.5156 | 0.7353 | +0.0% |
| Q5_K_M | 0.41 GB | 0.6671 | 0.5138 | 0.7377 | -0.3% |
| Q5_0 | 0.41 GB | 0.6678 | 0.5118 | 0.7423 | -0.2% |
| Q4_K_M | 0.37 GB | 0.6669 | 0.5120 | 0.7345 | -0.3% |
| Q4_0 | 0.36 GB | 0.6556 | 0.5010 | 0.7211 | -2.0% |
| Q3_K_M | 0.32 GB | 0.6551 | 0.5004 | 0.7354 | -2.1% |
| Q2_K | 0.28 GB | 0.4770 | 0.3104 | 0.5668 | -28.7% |
Takeaway: Q4_K_M (0.37 GB) is the sweet spot for 0.6B — 3x smaller than F16 with only 0.3% quality loss. Below Q4_K_M, quality starts to degrade: Q4_0 and Q3_K_M drop ~2%, and Q2_K is unusable (-28.7%). Smaller models are more sensitive to quantization than larger ones.
Does it work?
Yes. Most community GGUFs of Qwen3-Reranker produce garbage scores (4.5e-23) because they're missing reranker-specific tensors. See llama.cpp #16407. This one works:
Doc 0 (relevant): relevance_score = 0.98XX
Doc 1 (irrelevant): relevance_score = 0.00XX
Quick start
llama-server -m Qwen3-Reranker-0.6B-f16.gguf --reranking --pooling rank --embedding --port 8081
curl http://localhost:8081/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"query": "employment termination notice period",
"documents": [
"The Labour Code requires 30 calendar days written notice.",
"Corporate tax rates for small enterprises."
]
}'
Use /v1/rerank, not /v1/embeddings. The embeddings endpoint returns zeros for reranker models.
What's different about this GGUF?
The official convert_hf_to_gguf.py detects Qwen3-Reranker and does things naive converters skip:
- Extracts
cls.output.weight(the yes/no classifier) fromlm_head - Sets
pooling_type = RANKmetadata - Bakes in the rerank chat template
- Sets
classifier.output_labels = ["yes", "no"]
Without these, llama-server has nothing to compute scores from.
models.ini example
[Qwen3-Reranker-0.6B-f16]
model = /path/to/Qwen3-Reranker-0.6B-f16.gguf
reranking = true
pooling = rank
embedding = true
ctx-size = 32768
For a full multi-model setup guide (embedding + reranking + chat on one server), see the llama-server Qwen3 guide.
Convert it yourself
pip install huggingface_hub gguf torch safetensors sentencepiece
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-Reranker-0.6B', local_dir='Qwen3-Reranker-0.6B-src')"
python convert_hf_to_gguf.py --outtype f16 --outfile Qwen3-Reranker-0.6B-f16.gguf Qwen3-Reranker-0.6B-src/
License
Apache 2.0 — same as the original model.
- Downloads last month
- 974
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit