--- license: apache-2.0 base_model: Qwen/Qwen3-VL-4B-Instruct tags: - quantized - fp8 - 8bit - vision-language-model - qwen3-vl - compressed-tensors - vllm - smoothquant library_name: transformers pipeline_tag: image-to-text --- # Qwen3-VL-4B-Instruct-FP8-Dynamic This is a quantized version of [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) using **SmoothQuant + FP8_DYNAMIC** across all text linear layers. ## Quantization Strategy | Component | Scheme | Details | |-----------|--------|---------| | **All Text Linear Layers** | FP8_DYNAMIC | W8A8 dynamic quantization | | **Vision Encoder** | BF16 (unquantized) | Full precision for visual understanding | | **LM Head** | BF16 (unquantized) | Full precision for output quality | ### SmoothQuant Applied with strength 0.8 for activation smoothing before quantization: - Q/K/V projections ← input_layernorm - Gate/Up projections ← post_attention_layernorm ## Model Details - **Base Model**: Qwen/Qwen3-VL-4B-Instruct (4.4B parameters) - **Quantization Method**: compressed-tensors (llm-compressor) - **Model Size**: ~5.6 GB (reduced from ~8.9 GB BF16) - **Calibration**: 512 samples from flickr30k, max_seq_length=2048 ## Usage with vLLM ```bash vllm serve JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic \ --quantization compressed-tensors \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 ``` ```python from vllm import LLM, SamplingParams llm = LLM( model="JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic", quantization="compressed-tensors", trust_remote_code=True, kv_cache_dtype="fp8", max_model_len=8192, ) sampling_params = SamplingParams(temperature=0.7, top_p=0.95) prompts = ["Your prompt here"] outputs = llm.generate(prompts, sampling_params) ``` ## Usage with Transformers ```python from transformers import AutoProcessor, Qwen3VLForConditionalGeneration import torch model_id = "JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic" processor = AutoProcessor.from_pretrained(model_id) model = Qwen3VLForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) ``` ## License Apache 2.0, same as the base model.