Instructions to use kz-transformers/horde-vision with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kz-transformers/horde-vision with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="kz-transformers/horde-vision")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("kz-transformers/horde-vision")
model = AutoModelForImageTextToText.from_pretrained("kz-transformers/horde-vision")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use kz-transformers/horde-vision with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kz-transformers/horde-vision"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kz-transformers/horde-vision",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/kz-transformers/horde-vision

SGLang

How to use kz-transformers/horde-vision with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kz-transformers/horde-vision" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kz-transformers/horde-vision",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kz-transformers/horde-vision" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kz-transformers/horde-vision",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use kz-transformers/horde-vision with Docker Model Runner:
```
docker model run hf.co/kz-transformers/horde-vision
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

HordeVision: Open-Source Kazakh Vision-Language Model

HordeVision is a vision-language model specifically trained for the Kazakh language, designed to handle OCR, image captioning, visual question answering (VQA), reasoning, and instruction-following tasks.

Model Description

HordeVision is built to address the lack of vision-language models for low-resource languages like Kazakh. The model excels at:

Image Captioning: Generating detailed, contextual descriptions in Kazakh
Visual Question Answering (VQA): Answering diverse questions about image content
OCR: Extracting and reading Kazakh text from images
Visual Reasoning: Making inferences about context, causality, and temporal states
Instruction Following: Executing multi-step visual tasks based on user commands

Key Features

First open-source Kazakh vision-language model
Trained on ~50k culturally relevant images covering daily life, education, work, culture, and heritage
Two-stage training: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO)
Ranks #1 across all evaluation tasks compared to comparable multilingual models

Model Performance Summary

Model	caption	vqa	ocr	reason	instruct_follow	Avg Rank
horde-vision	83.5 (↑12.3%)	68.1 (↑5.3%)	64.7 (↑2.6%)	77.4 (↑5.7%)	70.5 (↑5.9%)	#1
Qolda	75.2 (↑8.7%)	61.7 (↑3.0%)	60.6 (↑2.0%)	70.3 (↑2.9%)	62.2 (↑2.8%)	#2
Qwen3-VL-8B-Instruct	41.3 (↑0.5%)	53.6 (↑1.1%)	59.3 (↑2.1%)	55.5 (↑0.7%)	49.5 (↑0.9%)	#3
gemma-3-4b-it	42.0 (↑0.1%)	41.8 (↑0.4%)	50.3 (↑2.3%)	53.0 (↑0.6%)	42.5 (↑0.5%)	#4
Qwen2.5-VL-7B-Instruct	35.4 (↑0.0%)	41.6 (↑0.4%)	51.0 (↑0.9%)	44.6 (↑0.3%)	37.7 (↑0.3%)	#5
Llama-3.2-11B-Vision	36.2 (↑0.1%)	38.0 (↑0.3%)	15.0 (↑0.1%)	43.4 (↑0.3%)	36.4 (↑0.3%)	#6
InternVL3-8B	26.1 (↑0.6%)	29.0 (↑0.0%)	29.1 (↑0.3%)	27.3 (↑0.0%)	25.7 (↑0.0%)	#7

Comparison: Outperforms Google Gemma 3-4B-IT, InternVL3-8B, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct, and ISSAI Qolda across all tasks.

Dataset

The training dataset was collected using a syntactic data generation pipeline:

Size: 45k training images, 5k validation images
Categories: 21 main categories, 104 subcategories, ~2,600 keyword phrases
Coverage: Daily contexts, social life, education, work/economy, media/communications, culture and heritage
Quality: Filtered with imagededup for deduplication and aesthetic scoring
Annotation: Labeled using GPT-4.1 with structured prompts for consistent quality
Split Strategy: Entity-level stratification to ensure models are tested on completely unseen entities

Training Details

Supervised Fine-Tuning (SFT)

Data: 46k images
LoRA Rank: 128
Epochs: 1

Reinforcement Learning (GRPO)

Data: 5k images
LoRA Rank: 64
Epochs: 1
Judge: GPT-4.1-mini with custom Kazakh evaluation prompts

How to Use

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "kz-transformers/horde-vision", dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
#     "kz-transformers/horde-vision",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

processor = AutoProcessor.from_pretrained("kz-transformers/horde-vision")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Бұл суретті сипаттаңыз."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Citation

If you use HordeVision, please cite:

Cite as: Pavel Zubitskii, Vitaliy Morozov, Sanzhar Murzakhmetov, et al. HordeVision: An Open-Source Kazakh Vision-Language Model. TechRxiv. January 10, 2026. DOI: 10.36227/techrxiv.176800904.47969417/v1

BibTeX

@article{zubitskii2026hordevision,
  title   = {HordeVision: An Open-Source Kazakh Vision-Language Model},
  author  = {Zubitskii, Pavel and Morozov, Vitaliy and Murzakhmetov, Sanzhar and others},
  journal = {TechRxiv},
  year    = {2026},
  month   = jan,
  doi     = {10.36227/techrxiv.176800904.47969417/v1}
}

Downloads last month: -

Safetensors

Model size

9B params

Tensor type

F32

BF16

Model tree for kz-transformers/horde-vision

Base model

Qwen/Qwen3-VL-8B-Instruct

Quantized

(80)

this model