Instructions to use Cubex11/Solari with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Cubex11/Solari with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Cubex11/Solari") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Cubex11/Solari") model = AutoModelForImageTextToText.from_pretrained("Cubex11/Solari") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Cubex11/Solari with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Cubex11/Solari" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Cubex11/Solari", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Cubex11/Solari
- SGLang
How to use Cubex11/Solari with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Cubex11/Solari" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Cubex11/Solari", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Cubex11/Solari" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Cubex11/Solari", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Cubex11/Solari with Docker Model Runner:
docker model run hf.co/Cubex11/Solari
Solari: Hallucination-Reduced Vision Language Model
Solari is a 500M parameter vision-language model fine-tuned for reduced hallucination on real-world images. Built on SmolVLM2-500M-Video-Instruct, Solari uses QLoRA + Direct Preference Optimization (DPO) on the RLAIF-V dataset to align the model toward more faithful visual descriptions.
Model Details
Model Description
Solari targets hallucination reduction in vision-language tasks, with a focus on improving reliability for accessibility applications (e.g., assisting visually impaired users). The model was trained using parameter-efficient fine-tuning (QLoRA) with DPO to learn preferences between accurate and hallucinated image descriptions, achieving improved hallucination benchmarks while preserving general VLM capabilities.
- Developed by: Cubex11
- Model type: Vision-Language Model (Image-Text-to-Text)
- Language(s): English
- License: Apache-2.0
- Finetuned from: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
Model Sources
- Base Model: SmolVLM2-500M-Video-Instruct
- Training Dataset: RLAIF-V (Formatted) — 72K AI-generated preference pairs for hallucination reduction
Uses
Direct Use
Solari can be used for image understanding tasks where factual accuracy is critical:
- Describing real-world scenes for visually impaired users
- Visual question answering with reduced hallucination
- Image captioning with improved object recognition reliability
Out-of-Scope Use
- Tasks requiring strong mathematical reasoning or code understanding (degraded from base model)
- Non-English language tasks
- Medical or safety-critical applications without additional validation
How to Get Started with the Model
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import requests
model_id = "Cubex11/Solari"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)
# Load an image (replace with your own image path or URL)
image = Image.open("your_image.jpg").convert("RGB")
# Create prompt
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=[[image]], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
trimmed = output[0][len(inputs.input_ids[0]):]
print(processor.decode(trimmed, skip_special_tokens=True))
Training Details
Training Data
RLAIF-V (Formatted) — a large-scale multimodal preference dataset containing ~72K preference pairs. Each sample includes an image, a prompt, a chosen response (more accurate), and a rejected response (more hallucinated). Preferences are generated by open-source AI models following the RLAIF-V methodology.
Training Procedure
Method: QLoRA + Direct Preference Optimization (DPO)
The base model was quantized to 4-bit (NF4) and fine-tuned using Low-Rank Adaptation (LoRA) with DPO to learn preferences between accurate and hallucinated responses.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Training regime | bf16 mixed precision |
| Quantization | 4-bit NF4 (double quantization) |
| LoRA rank (r) | 16 |
| LoRA alpha | 16 |
| LoRA dropout | 0.1 |
| DoRA | Enabled |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable params | ~1.9% of total |
| Learning rate | 5e-5 |
| DPO beta | 0.1 |
| Batch size | 8 (per device) |
| Gradient accumulation | 4 (effective batch = 32) |
| Epochs | 2 (best checkpoint at ~1 epoch / step 2500) |
| Warmup ratio | 0.1 |
| Optimizer | AdamW |
Speeds, Sizes, Times
- Training time: ~9 hours on NVIDIA L4 (24GB)
- Best checkpoint: Step 2500 (selected by lowest validation loss)
- Model size: ~1 GB (bf16 safetensors)
Evaluation
Testing Data, Factors & Metrics
Evaluated using VLMEvalKit on 8 standard benchmarks covering hallucination, general VLM capability, and real-world understanding.
Metrics
- POPE: F1 score across random/popular/adversarial splits (object hallucination)
- AMBER: Attribute, Existence, Relation accuracy (multi-dimensional hallucination)
- HallusionBench: aAcc, fAcc, qAcc (hallucination detection)
- A-OKVQA: Accuracy on outside-knowledge VQA
- MME: Perception and Reasoning scores
- MMStar: Multi-modal reasoning accuracy
- MMBench: General multi-modal understanding
- RealWorldQA: Real-world image understanding accuracy
Results
| Benchmark | Metric | Base Model | Solari | Change |
|---|---|---|---|---|
| POPE | Overall | 82.67 | 85.08 | +2.41 |
| POPE | Recall | 76.73 | 85.33 | +8.60 |
| AMBER | Avg ACC | 79.38 | 79.77 | +0.39 |
| AMBER | Relation | 72.36 | 75.42 | +3.06 |
| HallusionBench | Overall | 27.58 | 28.14 | +0.56 |
| A-OKVQA | Overall | 68.12 | 69.00 | +0.88 |
| MMStar | Overall | 38.33 | 39.60 | +1.27 |
| MMBench | Test | 53.14 | 53.42 | +0.28 |
| RealWorldQA | Overall | 49.80 | 50.59 | +0.78 |
| MME | Perception | 1216.19 | 1118.51 | -97.68 |
| MME | Reasoning | 237.50 | 211.79 | -25.71 |
Summary
Solari improves on 7 out of 8 benchmarks compared to the base model:
- POPE recall +8.60% — dramatically better at recognizing objects actually present in images
- All hallucination benchmarks improved — POPE, AMBER, and HallusionBench
- General capabilities preserved or improved — A-OKVQA, MMStar, MMBench, RealWorldQA all show gains
- Trade-off on MME — perception score dropped ~98 points, primarily on counting (-26.7), position (-26.7), and code reasoning (-27.5) subtasks due to the model becoming more conservative
Bias, Risks, and Limitations
- Counting and spatial reasoning degraded: The DPO alignment made the model more conservative, reducing performance on fine-grained counting and positional reasoning tasks (reflected in MME scores).
- Small model capacity: At 500M parameters, the model has inherent limitations on complex reasoning tasks.
- English only: The model was trained and evaluated only on English-language tasks.
- Training data bias: RLAIF-V preferences are AI-generated, which may introduce systematic biases.
Recommendations
- Best suited for binary object recognition tasks ("Is there a X?") and general scene description
- For tasks requiring precise counting or spatial reasoning, consider using the base model or a larger VLM
- Always validate outputs in safety-critical applications
Environmental Impact
- Hardware Type: NVIDIA L4 (24GB)
- Hours used: ~9 hours
- Cloud Provider: Lightning AI
- Compute Region: US
Technical Specifications
Model Architecture and Objective
- Architecture: SmolVLM2 (ViT vision encoder + LLM decoder with multi-modal projector)
- Parameters: ~500M total
- Objective: Direct Preference Optimization (DPO) — learns to prefer accurate descriptions over hallucinated ones
Compute Infrastructure
Hardware
NVIDIA L4 GPU (24GB VRAM) on Lightning AI
Software
- Transformers
- TRL (DPO Trainer)
- PEFT (QLoRA)
- BitsAndBytes (4-bit quantization)
Citation
BibTeX:
@misc{solari2026,
title={Solari: Hallucination-Reduced Vision Language Model via QLoRA DPO on RLAIF-V},
author={Cubex11},
year={2026},
url={https://huggingface.co/Cubex11/Solari}
}
Acknowledgments
- HuggingFace for SmolVLM2 and the RLAIF-V formatted dataset
- OpenBMB for the RLAIF-V and RLHF-V research
- Lightning AI for compute resources
- OpenCompass for the VLMEvalKit evaluation toolkit
- Downloads last month
- 9
Model tree for Cubex11/Solari
Base model
HuggingFaceTB/SmolLM2-360M