---
license: apache-2.0
base_model: Qwen/Qwen3-VL-4B-Instruct
tags:
- quantized
- fp8
- 8bit
- vision-language-model
- qwen3-vl
- compressed-tensors
- vllm
- smoothquant
library_name: transformers
pipeline_tag: image-to-text
---

# Qwen3-VL-4B-Instruct-FP8-Dynamic

This is a quantized version of [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) using **SmoothQuant + FP8_DYNAMIC** across all text linear layers.

## Quantization Strategy

| Component | Scheme | Details |
|-----------|--------|---------|
| **All Text Linear Layers** | FP8_DYNAMIC | W8A8 dynamic quantization |
| **Vision Encoder** | BF16 (unquantized) | Full precision for visual understanding |
| **LM Head** | BF16 (unquantized) | Full precision for output quality |

### SmoothQuant

Applied with strength 0.8 for activation smoothing before quantization:
- Q/K/V projections ← input_layernorm
- Gate/Up projections ← post_attention_layernorm

## Model Details

- **Base Model**: Qwen/Qwen3-VL-4B-Instruct (4.4B parameters)
- **Quantization Method**: compressed-tensors (llm-compressor)
- **Model Size**: ~5.6 GB (reduced from ~8.9 GB BF16)
- **Calibration**: 512 samples from flickr30k, max_seq_length=2048

## Usage with vLLM

```bash
vllm serve JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic \
    --quantization compressed-tensors \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192
```

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic",
    quantization="compressed-tensors",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
    max_model_len=8192,
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
prompts = ["Your prompt here"]
outputs = llm.generate(prompts, sampling_params)
```

## Usage with Transformers

```python
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
import torch

model_id = "JEILDLWLRMA/Qwen3-VL-4B-Instruct-FP8-Dynamic"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
```

## License

Apache 2.0, same as the base model.