Instructions to use kz-transformers/horde-vision with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kz-transformers/horde-vision with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="kz-transformers/horde-vision") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("kz-transformers/horde-vision") model = AutoModelForImageTextToText.from_pretrained("kz-transformers/horde-vision") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use kz-transformers/horde-vision with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kz-transformers/horde-vision" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kz-transformers/horde-vision", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/kz-transformers/horde-vision
- SGLang
How to use kz-transformers/horde-vision with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kz-transformers/horde-vision" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kz-transformers/horde-vision", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kz-transformers/horde-vision" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kz-transformers/horde-vision", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use kz-transformers/horde-vision with Docker Model Runner:
docker model run hf.co/kz-transformers/horde-vision
HordeVision: Open-Source Kazakh Vision-Language Model
HordeVision is a vision-language model specifically trained for the Kazakh language, designed to handle OCR, image captioning, visual question answering (VQA), reasoning, and instruction-following tasks.
Model Description
HordeVision is built to address the lack of vision-language models for low-resource languages like Kazakh. The model excels at:
- Image Captioning: Generating detailed, contextual descriptions in Kazakh
- Visual Question Answering (VQA): Answering diverse questions about image content
- OCR: Extracting and reading Kazakh text from images
- Visual Reasoning: Making inferences about context, causality, and temporal states
- Instruction Following: Executing multi-step visual tasks based on user commands
Key Features
- First open-source Kazakh vision-language model
- Trained on ~50k culturally relevant images covering daily life, education, work, culture, and heritage
- Two-stage training: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO)
- Ranks #1 across all evaluation tasks compared to comparable multilingual models
Model Performance Summary
| Model | caption | vqa | ocr | reason | instruct_follow | Avg Rank |
|---|---|---|---|---|---|---|
| horde-vision | 83.5 (↑12.3%) | 68.1 (↑5.3%) | 64.7 (↑2.6%) | 77.4 (↑5.7%) | 70.5 (↑5.9%) | #1 |
| Qolda | 75.2 (↑8.7%) | 61.7 (↑3.0%) | 60.6 (↑2.0%) | 70.3 (↑2.9%) | 62.2 (↑2.8%) | #2 |
| Qwen3-VL-8B-Instruct | 41.3 (↑0.5%) | 53.6 (↑1.1%) | 59.3 (↑2.1%) | 55.5 (↑0.7%) | 49.5 (↑0.9%) | #3 |
| gemma-3-4b-it | 42.0 (↑0.1%) | 41.8 (↑0.4%) | 50.3 (↑2.3%) | 53.0 (↑0.6%) | 42.5 (↑0.5%) | #4 |
| Qwen2.5-VL-7B-Instruct | 35.4 (↑0.0%) | 41.6 (↑0.4%) | 51.0 (↑0.9%) | 44.6 (↑0.3%) | 37.7 (↑0.3%) | #5 |
| Llama-3.2-11B-Vision | 36.2 (↑0.1%) | 38.0 (↑0.3%) | 15.0 (↑0.1%) | 43.4 (↑0.3%) | 36.4 (↑0.3%) | #6 |
| InternVL3-8B | 26.1 (↑0.6%) | 29.0 (↑0.0%) | 29.1 (↑0.3%) | 27.3 (↑0.0%) | 25.7 (↑0.0%) | #7 |
Comparison: Outperforms Google Gemma 3-4B-IT, InternVL3-8B, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct, and ISSAI Qolda across all tasks.
Dataset
The training dataset was collected using a syntactic data generation pipeline:
- Size: 45k training images, 5k validation images
- Categories: 21 main categories, 104 subcategories, ~2,600 keyword phrases
- Coverage: Daily contexts, social life, education, work/economy, media/communications, culture and heritage
- Quality: Filtered with imagededup for deduplication and aesthetic scoring
- Annotation: Labeled using GPT-4.1 with structured prompts for consistent quality
- Split Strategy: Entity-level stratification to ensure models are tested on completely unseen entities
Training Details
Supervised Fine-Tuning (SFT)
- Data: 46k images
- LoRA Rank: 128
- Epochs: 1
Reinforcement Learning (GRPO)
- Data: 5k images
- LoRA Rank: 64
- Epochs: 1
- Judge: GPT-4.1-mini with custom Kazakh evaluation prompts
How to Use
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
"kz-transformers/horde-vision", dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
# "kz-transformers/horde-vision",
# dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
processor = AutoProcessor.from_pretrained("kz-transformers/horde-vision")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Бұл суретті сипаттаңыз."},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Citation
If you use HordeVision, please cite:
Cite as: Pavel Zubitskii, Vitaliy Morozov, Sanzhar Murzakhmetov, et al. HordeVision: An Open-Source Kazakh Vision-Language Model. TechRxiv. January 10, 2026. DOI: 10.36227/techrxiv.176800904.47969417/v1
BibTeX
@article{zubitskii2026hordevision,
title = {HordeVision: An Open-Source Kazakh Vision-Language Model},
author = {Zubitskii, Pavel and Morozov, Vitaliy and Murzakhmetov, Sanzhar and others},
journal = {TechRxiv},
year = {2026},
month = jan,
doi = {10.36227/techrxiv.176800904.47969417/v1}
}
- Downloads last month
- -
Model tree for kz-transformers/horde-vision
Base model
Qwen/Qwen3-VL-8B-Instruct