Model Logo

Solari: Hallucination-Reduced Vision Language Model

Solari is a 500M parameter vision-language model fine-tuned for reduced hallucination on real-world images. Built on SmolVLM2-500M-Video-Instruct, Solari uses QLoRA + Direct Preference Optimization (DPO) on the RLAIF-V dataset to align the model toward more faithful visual descriptions.

Model Details

Model Description

Solari targets hallucination reduction in vision-language tasks, with a focus on improving reliability for accessibility applications (e.g., assisting visually impaired users). The model was trained using parameter-efficient fine-tuning (QLoRA) with DPO to learn preferences between accurate and hallucinated image descriptions, achieving improved hallucination benchmarks while preserving general VLM capabilities.

Model Sources

Uses

Direct Use

Solari can be used for image understanding tasks where factual accuracy is critical:

  • Describing real-world scenes for visually impaired users
  • Visual question answering with reduced hallucination
  • Image captioning with improved object recognition reliability

Out-of-Scope Use

  • Tasks requiring strong mathematical reasoning or code understanding (degraded from base model)
  • Non-English language tasks
  • Medical or safety-critical applications without additional validation

How to Get Started with the Model

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import requests

model_id = "Cubex11/Solari"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

# Load an image (replace with your own image path or URL)
image = Image.open("your_image.jpg").convert("RGB")

# Create prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=[[image]], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
trimmed = output[0][len(inputs.input_ids[0]):]
print(processor.decode(trimmed, skip_special_tokens=True))

Training Details

Training Data

RLAIF-V (Formatted) — a large-scale multimodal preference dataset containing ~72K preference pairs. Each sample includes an image, a prompt, a chosen response (more accurate), and a rejected response (more hallucinated). Preferences are generated by open-source AI models following the RLAIF-V methodology.

Training Procedure

Method: QLoRA + Direct Preference Optimization (DPO)

The base model was quantized to 4-bit (NF4) and fine-tuned using Low-Rank Adaptation (LoRA) with DPO to learn preferences between accurate and hallucinated responses.

Training Hyperparameters

Parameter Value
Training regime bf16 mixed precision
Quantization 4-bit NF4 (double quantization)
LoRA rank (r) 16
LoRA alpha 16
LoRA dropout 0.1
DoRA Enabled
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params ~1.9% of total
Learning rate 5e-5
DPO beta 0.1
Batch size 8 (per device)
Gradient accumulation 4 (effective batch = 32)
Epochs 2 (best checkpoint at ~1 epoch / step 2500)
Warmup ratio 0.1
Optimizer AdamW

Speeds, Sizes, Times

  • Training time: ~9 hours on NVIDIA L4 (24GB)
  • Best checkpoint: Step 2500 (selected by lowest validation loss)
  • Model size: ~1 GB (bf16 safetensors)

Evaluation

Testing Data, Factors & Metrics

Evaluated using VLMEvalKit on 8 standard benchmarks covering hallucination, general VLM capability, and real-world understanding.

Metrics

  • POPE: F1 score across random/popular/adversarial splits (object hallucination)
  • AMBER: Attribute, Existence, Relation accuracy (multi-dimensional hallucination)
  • HallusionBench: aAcc, fAcc, qAcc (hallucination detection)
  • A-OKVQA: Accuracy on outside-knowledge VQA
  • MME: Perception and Reasoning scores
  • MMStar: Multi-modal reasoning accuracy
  • MMBench: General multi-modal understanding
  • RealWorldQA: Real-world image understanding accuracy

Results

Benchmark Metric Base Model Solari Change
POPE Overall 82.67 85.08 +2.41
POPE Recall 76.73 85.33 +8.60
AMBER Avg ACC 79.38 79.77 +0.39
AMBER Relation 72.36 75.42 +3.06
HallusionBench Overall 27.58 28.14 +0.56
A-OKVQA Overall 68.12 69.00 +0.88
MMStar Overall 38.33 39.60 +1.27
MMBench Test 53.14 53.42 +0.28
RealWorldQA Overall 49.80 50.59 +0.78
MME Perception 1216.19 1118.51 -97.68
MME Reasoning 237.50 211.79 -25.71

Summary

Solari improves on 7 out of 8 benchmarks compared to the base model:

  • POPE recall +8.60% — dramatically better at recognizing objects actually present in images
  • All hallucination benchmarks improved — POPE, AMBER, and HallusionBench
  • General capabilities preserved or improved — A-OKVQA, MMStar, MMBench, RealWorldQA all show gains
  • Trade-off on MME — perception score dropped ~98 points, primarily on counting (-26.7), position (-26.7), and code reasoning (-27.5) subtasks due to the model becoming more conservative

Bias, Risks, and Limitations

  • Counting and spatial reasoning degraded: The DPO alignment made the model more conservative, reducing performance on fine-grained counting and positional reasoning tasks (reflected in MME scores).
  • Small model capacity: At 500M parameters, the model has inherent limitations on complex reasoning tasks.
  • English only: The model was trained and evaluated only on English-language tasks.
  • Training data bias: RLAIF-V preferences are AI-generated, which may introduce systematic biases.

Recommendations

  • Best suited for binary object recognition tasks ("Is there a X?") and general scene description
  • For tasks requiring precise counting or spatial reasoning, consider using the base model or a larger VLM
  • Always validate outputs in safety-critical applications

Environmental Impact

  • Hardware Type: NVIDIA L4 (24GB)
  • Hours used: ~9 hours
  • Cloud Provider: Lightning AI
  • Compute Region: US

Technical Specifications

Model Architecture and Objective

  • Architecture: SmolVLM2 (ViT vision encoder + LLM decoder with multi-modal projector)
  • Parameters: ~500M total
  • Objective: Direct Preference Optimization (DPO) — learns to prefer accurate descriptions over hallucinated ones

Compute Infrastructure

Hardware

NVIDIA L4 GPU (24GB VRAM) on Lightning AI

Software

  • Transformers
  • TRL (DPO Trainer)
  • PEFT (QLoRA)
  • BitsAndBytes (4-bit quantization)

Citation

BibTeX:

@misc{solari2026,
    title={Solari: Hallucination-Reduced Vision Language Model via QLoRA DPO on RLAIF-V},
    author={Cubex11},
    year={2026},
    url={https://huggingface.co/Cubex11/Solari}
}

Acknowledgments

Downloads last month
9
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Cubex11/Solari

Dataset used to train Cubex11/Solari