Instructions to use Cubex11/Solari with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Cubex11/Solari with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Cubex11/Solari")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Cubex11/Solari")
model = AutoModelForImageTextToText.from_pretrained("Cubex11/Solari")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Cubex11/Solari with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Cubex11/Solari"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cubex11/Solari",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Cubex11/Solari

SGLang

How to use Cubex11/Solari with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Cubex11/Solari" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cubex11/Solari",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Cubex11/Solari" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cubex11/Solari",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Cubex11/Solari with Docker Model Runner:
```
docker model run hf.co/Cubex11/Solari
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Solari: Hallucination-Reduced Vision Language Model

Solari is a 500M parameter vision-language model fine-tuned for reduced hallucination on real-world images. Built on SmolVLM2-500M-Video-Instruct, Solari uses QLoRA + Direct Preference Optimization (DPO) on the RLAIF-V dataset to align the model toward more faithful visual descriptions.

Model Details

Model Description

Solari targets hallucination reduction in vision-language tasks, with a focus on improving reliability for accessibility applications (e.g., assisting visually impaired users). The model was trained using parameter-efficient fine-tuning (QLoRA) with DPO to learn preferences between accurate and hallucinated image descriptions, achieving improved hallucination benchmarks while preserving general VLM capabilities.

Developed by: Cubex11
Model type: Vision-Language Model (Image-Text-to-Text)
Language(s): English
License: Apache-2.0
Finetuned from: HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Model Sources

Base Model: SmolVLM2-500M-Video-Instruct
Training Dataset: RLAIF-V (Formatted) — 72K AI-generated preference pairs for hallucination reduction

Uses

Direct Use

Solari can be used for image understanding tasks where factual accuracy is critical:

Describing real-world scenes for visually impaired users
Visual question answering with reduced hallucination
Image captioning with improved object recognition reliability

Out-of-Scope Use

Tasks requiring strong mathematical reasoning or code understanding (degraded from base model)
Non-English language tasks
Medical or safety-critical applications without additional validation

How to Get Started with the Model

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import requests

model_id = "Cubex11/Solari"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

# Load an image (replace with your own image path or URL)
image = Image.open("your_image.jpg").convert("RGB")

# Create prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=[[image]], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
trimmed = output[0][len(inputs.input_ids[0]):]
print(processor.decode(trimmed, skip_special_tokens=True))

Training Details

Training Data

RLAIF-V (Formatted) — a large-scale multimodal preference dataset containing ~72K preference pairs. Each sample includes an image, a prompt, a chosen response (more accurate), and a rejected response (more hallucinated). Preferences are generated by open-source AI models following the RLAIF-V methodology.

Training Procedure

Method: QLoRA + Direct Preference Optimization (DPO)

The base model was quantized to 4-bit (NF4) and fine-tuned using Low-Rank Adaptation (LoRA) with DPO to learn preferences between accurate and hallucinated responses.

Training Hyperparameters

Parameter	Value
Training regime	bf16 mixed precision
Quantization	4-bit NF4 (double quantization)
LoRA rank (r)	16
LoRA alpha	16
LoRA dropout	0.1
DoRA	Enabled
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params	~1.9% of total
Learning rate	5e-5
DPO beta	0.1
Batch size	8 (per device)
Gradient accumulation	4 (effective batch = 32)
Epochs	2 (best checkpoint at ~1 epoch / step 2500)
Warmup ratio	0.1
Optimizer	AdamW

Speeds, Sizes, Times

Training time: ~9 hours on NVIDIA L4 (24GB)
Best checkpoint: Step 2500 (selected by lowest validation loss)
Model size: ~1 GB (bf16 safetensors)

Evaluation

Testing Data, Factors & Metrics

Evaluated using VLMEvalKit on 8 standard benchmarks covering hallucination, general VLM capability, and real-world understanding.

Metrics

POPE: F1 score across random/popular/adversarial splits (object hallucination)
AMBER: Attribute, Existence, Relation accuracy (multi-dimensional hallucination)
HallusionBench: aAcc, fAcc, qAcc (hallucination detection)
A-OKVQA: Accuracy on outside-knowledge VQA
MME: Perception and Reasoning scores
MMStar: Multi-modal reasoning accuracy
MMBench: General multi-modal understanding
RealWorldQA: Real-world image understanding accuracy

Results

Benchmark	Metric	Base Model	Solari	Change
POPE	Overall	82.67	85.08	+2.41
POPE	Recall	76.73	85.33	+8.60
AMBER	Avg ACC	79.38	79.77	+0.39
AMBER	Relation	72.36	75.42	+3.06
HallusionBench	Overall	27.58	28.14	+0.56
A-OKVQA	Overall	68.12	69.00	+0.88
MMStar	Overall	38.33	39.60	+1.27
MMBench	Test	53.14	53.42	+0.28
RealWorldQA	Overall	49.80	50.59	+0.78
MME	Perception	1216.19	1118.51	-97.68
MME	Reasoning	237.50	211.79	-25.71

Summary

Solari improves on 7 out of 8 benchmarks compared to the base model:

POPE recall +8.60% — dramatically better at recognizing objects actually present in images
All hallucination benchmarks improved — POPE, AMBER, and HallusionBench
General capabilities preserved or improved — A-OKVQA, MMStar, MMBench, RealWorldQA all show gains
Trade-off on MME — perception score dropped ~98 points, primarily on counting (-26.7), position (-26.7), and code reasoning (-27.5) subtasks due to the model becoming more conservative

Bias, Risks, and Limitations

Counting and spatial reasoning degraded: The DPO alignment made the model more conservative, reducing performance on fine-grained counting and positional reasoning tasks (reflected in MME scores).
Small model capacity: At 500M parameters, the model has inherent limitations on complex reasoning tasks.
English only: The model was trained and evaluated only on English-language tasks.
Training data bias: RLAIF-V preferences are AI-generated, which may introduce systematic biases.

Recommendations

Best suited for binary object recognition tasks ("Is there a X?") and general scene description
For tasks requiring precise counting or spatial reasoning, consider using the base model or a larger VLM
Always validate outputs in safety-critical applications

Environmental Impact

Hardware Type: NVIDIA L4 (24GB)
Hours used: ~9 hours
Cloud Provider: Lightning AI
Compute Region: US

Technical Specifications

Model Architecture and Objective

Architecture: SmolVLM2 (ViT vision encoder + LLM decoder with multi-modal projector)
Parameters: ~500M total
Objective: Direct Preference Optimization (DPO) — learns to prefer accurate descriptions over hallucinated ones

Compute Infrastructure

Hardware

NVIDIA L4 GPU (24GB VRAM) on Lightning AI

Software

Transformers
TRL (DPO Trainer)
PEFT (QLoRA)
BitsAndBytes (4-bit quantization)

Citation

BibTeX:

@misc{solari2026,
    title={Solari: Hallucination-Reduced Vision Language Model via QLoRA DPO on RLAIF-V},
    author={Cubex11},
    year={2026},
    url={https://huggingface.co/Cubex11/Solari}
}

Acknowledgments

HuggingFace for SmolVLM2 and the RLAIF-V formatted dataset
OpenBMB for the RLAIF-V and RLHF-V research
Lightning AI for compute resources
OpenCompass for the VLMEvalKit evaluation toolkit

Downloads last month: 9

Safetensors

Model size

0.5B params

Tensor type

BF16

Model tree for Cubex11/Solari

Base model

HuggingFaceTB/SmolLM2-360M

Quantized

HuggingFaceTB/SmolLM2-360M-Instruct

Quantized

HuggingFaceTB/SmolVLM-500M-Instruct

Quantized

HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Finetuned

(74)

this model

Quantizations

1 model

Cubex11
/

Solari

Solari: Hallucination-Reduced Vision Language Model

Model Details

Model Description

Model Sources

Uses

Direct Use

Out-of-Scope Use

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Training Hyperparameters

Speeds, Sizes, Times

Evaluation

Testing Data, Factors & Metrics

Metrics

Results

Summary

Bias, Risks, and Limitations

Recommendations

Environmental Impact

Technical Specifications

Model Architecture and Objective

Compute Infrastructure

Hardware

Software

Citation

Acknowledgments

Model tree for Cubex11/Solari

Dataset used to train Cubex11/Solari