Qwen3.5-9B VLM Q4_K_M GGUF

Ready-to-run Vision Language Model — bundled Q4_K_M quantization + multimodal projector for immediate VLM inference with llama.cpp.

Overview

This repository provides a complete, self-contained VLM package for Qwen3.5-9B in GGUF format. Unlike text-only GGUF repos, this includes the multimodal projector (mmproj) required for vision capabilities — no need to hunt for separate files.

File	Description	Size
`Qwen3.5-9B-Q4_K_M.gguf`	Main model weights (4-bit quantized)	~6.0 GB
`mmproj-F16.gguf`	Vision encoder projection (FP16)	~918 MB

Total VRAM/RAM required: ~7–8 GB for full VLM inference

Key Features

Natively Multimodal — Qwen3.5 uses early fusion architecture where vision and language are jointly trained from the ground up, not bolted on as an afterthought
Outperforms Qwen3-VL — Cross-generational improvements across visual understanding, reasoning, coding, and agent benchmarks
262K Context Length — Native 262,144 token context window, extensible to 1M+ tokens
201 Languages — Broad multilingual support including CJK, European, Arabic, Southeast Asian, and more
Efficient Quantization — Q4_K_M provides an excellent balance of quality and size, retaining strong performance while fitting in ~6 GB

Architecture

Qwen3.5 represents a paradigm shift from the separate "LLM + Vision Encoder" approach (used by Qwen3-VL, LLaVA, etc.) to a unified early-fusion architecture:

Image → Vision Encoder → Multimodal Projector (mmproj) ─┐
                                                         ├→ Unified Transformer → Output
Text → Token Embedding ─────────────────────────────────┘

The vision encoder processes images into visual tokens that are projected into the same embedding space as text tokens via the mmproj file, enabling seamless multimodal reasoning within a single transformer.

Quickstart

Prerequisites

llama.cpp (built from source or pre-built binaries)
~8 GB available RAM/VRAM

Download

# Download both files
huggingface-cli download jc-builds/Qwen3.5-9B-VLM-Q4_K_M-GGUF \
  --local-dir Qwen3.5-9B-VLM-Q4_K_M-GGUF

Run with llama.cpp CLI

# Image understanding
./llama-cli \
  --model Qwen3.5-9B-VLM-Q4_K_M-GGUF/Qwen3.5-9B-Q4_K_M.gguf \
  --mmproj Qwen3.5-9B-VLM-Q4_K_M-GGUF/mmproj-F16.gguf \
  --image your_image.jpg \
  -p "Describe what you see in this image in detail." \
  --temp 0.7 --top-p 0.9 \
  -n 512

# Text-only mode (omit --mmproj and --image)
./llama-cli \
  --model Qwen3.5-9B-VLM-Q4_K_M-GGUF/Qwen3.5-9B-Q4_K_M.gguf \
  -p "Explain quantum entanglement simply." \
  --temp 0.7 \
  -n 512

Run with llama.cpp Server

./llama-server \
  --model Qwen3.5-9B-VLM-Q4_K_M-GGUF/Qwen3.5-9B-Q4_K_M.gguf \
  --mmproj Qwen3.5-9B-VLM-Q4_K_M-GGUF/mmproj-F16.gguf \
  --port 8080

Then send requests with images via the OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-9b-vlm",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
        ]
      }
    ]
  }'

Use with iOS / On-Device (llama.cpp Swift bindings)

This model is optimized for on-device inference. Load both files in your llama.cpp-based iOS app:

// Load main model
let modelParams = llama_model_default_params()
let model = llama_model_load_from_file("Qwen3.5-9B-Q4_K_M.gguf", modelParams)

// Load vision projector
let clipModel = clip_model_load("mmproj-F16.gguf", /* verbosity */ 1)

Quantization Details

Property	Value
Method	Q4_K_M (4-bit, mixed precision)
Base Model	Qwen/Qwen3.5-9B
Parameters	9.6B (dense)
Vision Projector	FP16 (full precision for visual quality)
GQA Groups	4
Context Length	262,144 tokens
Vocabulary	200K+ tokens

Q4_K_M uses a mixed quantization strategy where more important layers retain higher precision, providing a strong quality-to-size ratio. The vision projector is kept at FP16 to preserve visual understanding fidelity.

Compatibility

Backend	Text	Vision	Notes
llama.cpp (CLI/Server)	Yes	Yes	Full support with `--mmproj` flag
llama.cpp (Swift/iOS)	Yes	Yes	Load both model + clip files
LM Studio	Yes	Partial	Check for mmproj support
Ollama	Yes	No	Does not support separate mmproj files yet
KoboldCpp	Yes	Yes	Supports `--mmproj` flag

Benchmarks

Qwen3.5-9B delivers exceptional performance across vision, text, and reasoning tasks — often matching or exceeding models many times its size.

Note: Quantization to Q4_K_M typically results in minimal degradation (~1-2%) compared to FP16 for models of this size. All scores below are from the FP16 base model.

Vision & Multimodal Performance

Vision benchmark comparison

Full Vision Benchmark Table (click to expand)

Benchmark	Qwen3.5-9B	GPT-5-Nano	Gemini 2.5 Flash	Qwen3-VL-30B
STEM & Puzzle
MMMU	78.4	75.8	73.4	76.0
MMMU-Pro	70.1	57.2	59.7	63.0
MathVision	78.9	62.2	52.1	65.7
MathVista (mini)	85.7	71.5	72.8	81.9
We-Math	75.2	62.5	32.1	70.0
DynaMath	83.6	78.0	69.9	80.1
VLMs Are Blind	93.7	66.7	68.4	72.5
General VQA
RealWorldQA	80.3	71.8	72.2	77.4
MMStar	79.7	68.6	69.1	75.5
MMBench EN v1.1	90.1	80.3	82.7	88.9
HallusionBench	69.3	58.4	64.5	66.0
OCR & Documents
OmniDocBench 1.5	87.7	55.9	79.4	86.8
CharXiv (RQ)	73.0	50.1	56.1	56.6
MMLongBench-Doc	57.7	31.8	46.5	47.4
CC-OCR	79.3	58.9	72.9	77.8
AI2D	90.2	81.9	85.7	86.9
OCRBench	89.2	75.3	82.5	83.9
Spatial Intelligence
CountBench	97.2	80.0	79.2	90.0
EmbSpatialBench	83.0	74.2	66.1	80.6
RefSpatialBench	58.5	12.6	11.2	54.2
LingoQA	80.4	57.0	17.8	62.0
Video Understanding
VideoMME (w/ sub)	84.5	71.7	74.6	79.9
VideoMME (w/o sub)	78.4	66.2	72.7	73.3
VideoMMMU	78.9	63.0	69.2	75.0
MLVU	84.4	69.2	78.5	78.9
MMVU	67.8	63.1	65.3	66.1
Visual Agent
ScreenSpot Pro	65.2	—	—	60.5
OSWorld	41.8	—	—	30.6
AndroidWorld	57.8	—	—	55.0
Medical VQA
SLAKE	79.0	57.0	65.0	68.8
PMC-VQA	57.9	37.8	48.8	51.5
MedXpertQA-MM	49.9	26.7	35.3	35.5

Capability Radar

Capability radar chart

Text & Reasoning Performance

Text benchmark comparison

Full Text Benchmark Table (click to expand)

Benchmark	Qwen3.5-9B	GPT-OSS-120B	Qwen3-30B-A3B
Knowledge & STEM
MMLU-Pro	82.5	80.8	80.9
MMLU-Redux	91.1	91.0	91.4
GPQA Diamond	81.7	80.1	73.4
SuperGPQA	58.2	54.6	56.8
C-Eval	88.2	76.2	87.4
Instruction Following
IFEval	91.5	88.9	88.9
IFBench	64.5	69.0	51.5
MultiChallenge	54.5	45.3	46.5
Reasoning & Math
HMMT Feb '25	83.2	90.0	63.1
HMMT Nov '25	82.9	90.0	73.8
LiveCodeBench v6	65.6	82.7	66.0
OJBench	29.2	41.5	25.1
Long Context
LongBench v2	55.2	48.2	44.8
AA-LCR	63.0	50.7	49.0
Agent & Tool Use
BFCL-V4	66.1	—	42.4
TAU2-Bench	79.1	—	41.9
VITA-Bench	29.8	—	14.1
Multilingual
MMMLU	81.2	78.2	78.4
MMLU-ProX	76.3	74.5	69.1
NOVA-63	55.9	51.1	52.5
INCLUDE	75.6	74.0	74.4
WMT24++	72.6	74.4	69.3

Credits & Acknowledgments

Base Model: Qwen Team, Alibaba Cloud
mmproj Extraction: Derived from unsloth/Qwen3.5-9B-GGUF
Quantization Format: GGML / llama.cpp

License

This model is released under the Apache 2.0 License, consistent with the base Qwen3.5-9B model.

Citation

@article{qwen3.5,
  title={Qwen3.5 Technical Report},
  author={Qwen Team},
  year={2025},
  url={https://qwen.ai/blog?id=qwen3.5}
}

Packaged by jc-builds for easy on-device VLM deployment.

Downloads last month: 8,028

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for jc-builds/Qwen3.5-9B-VLM-Q4_K_M-GGUF

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(172)

this model