Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GPTQ-int4

This is a GPTQ INT4 quantized version of Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.

Please refer to the original model card for details on the model architecture, training data, and capabilities.

Note: While the original fine-tuning focused on text-only reasoning tasks, this model inherits multimodal capabilities from the base Qwen3.5-35B-A3B. The vision encoder is preserved and functional for image understanding tasks.

Model Architecture

This is a Mixture-of-Experts (MoE) model with:

  • Total Parameters: 35B
  • Active Parameters: ~3B per token
  • Experts: 256 total, 8 active per token

Quantization Details

  • Method: GPTQ (4-bit INT4, W4A16)
  • Group Size: 128
  • Calibration: 1024 samples from C4 dataset (~2048 tokens average)
  • Vision Encoder: Preserved (not quantized)
  • MTP Module: Preserved (not quantized)

Usage with vLLM

Text-only

from vllm import LLM, SamplingParams

llm = LLM(
    model="codgician/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GPTQ-int4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.9,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=2048)
prompt = "Explain the difference between TCP and UDP protocols."
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

With Image (Multimodal)

from vllm import LLM, SamplingParams

llm = LLM(
    model="codgician/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GPTQ-int4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.9,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
            {"type": "text", "text": "What is in this image?"}
        ]
    }
]
outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs[0].text)

Hardware Requirements

Precision VRAM (Approx.)
INT4 GPTQ ~22 GB

Acknowledgements

License

Apache 2.0 (inherited from original model)

Downloads last month
2,304
Safetensors
Model size
36B params
Tensor type
BF16
F32
I32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for codgician/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GPTQ-int4