๐Ÿ“Š alan-vlm โ€” ChartQA Vision Language Model

A finetuned version of Qwen2-VL 2B Instruct trained to answer natural language questions about charts and graphs.

Finetuned on the ChartQA dataset using Unsloth on a Google Colab free T4 GPU.


๐Ÿง  Model Details

Base Model Qwen2-VL-2B-Instruct
Finetuning Method LoRA (r=8, alpha=8)
Training Data 2,000 chart QA pairs
Training Steps 500
Batch Size 8 (2 per device ร— 4 gradient accumulation)
Trainable Parameters 9,232,384 (0.42% of total)
Precision fp16
Hardware Google Colab T4 (15GB VRAM)

๐Ÿš€ Quick Start

from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from PIL import Image
import torch

# Load model
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "alanjoshua2005/alan-vlm",
    torch_dtype=torch.float16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("alanjoshua2005/alan-vlm")

# Run inference
def ask(image_path, question):
    image = Image.open(image_path).convert("RGB")

    messages = [{"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": question},
    ]}]

    text_prompt = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False,
    )

    inputs = processor(
        text=text_prompt,
        images=image,
        return_tensors="pt"
    )
    inputs = {k: v.to("cuda") for k, v in inputs.items()}

    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=64)

    input_len = inputs["input_ids"].shape[1]
    return processor.decode(output[0][input_len:], skip_special_tokens=True)

# Example
answer = ask("chart.png", "What is the value of the highest bar?")
print(answer)

๐ŸŽ›๏ธ Gradio Demo

import gradio as gr
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from PIL import Image
import torch

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "alanjoshua2005/alan-vlm",
    torch_dtype=torch.float16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("alanjoshua2005/alan-vlm")

def answer_chart_question(image, question):
    if image is None or not question.strip():
        return "Please provide both an image and a question."
    image = image.convert("RGB")
    messages = [{"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": question},
    ]}]
    text_prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
    inputs = processor(text=text_prompt, images=image, return_tensors="pt")
    inputs = {k: v.to("cuda") for k, v in inputs.items()}
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=64)
    input_len = inputs["input_ids"].shape[1]
    return processor.decode(output[0][input_len:], skip_special_tokens=True)

gr.Interface(
    fn=answer_chart_question,
    inputs=[gr.Image(type="pil"), gr.Textbox(label="Question")],
    outputs=gr.Textbox(label="Answer"),
    title="๐Ÿ“Š ChartQA - alan-vlm"
).launch()

๐Ÿ“ฆ Dataset

Trained on weijiezz/chartqa_split_test โ€” a 2,000 row dataset of chart images paired with questions and answers. Contains two types of questions:

  • human_test โ€” questions written by human annotators
  • augmented_test โ€” questions generated via data augmentation

๐Ÿ‹๏ธ Training Details

Training was done using Unsloth for optimized LoRA finetuning:

from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2-VL-2B-Instruct",
    load_in_4bit=True,
)

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=8,
    lora_alpha=8,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
)

โš ๏ธ Limitations

  • Trained on only 2,000 samples โ€” a learning/experimental project
  • May struggle with complex multi-series charts or heavily annotated graphs
  • Not evaluated on the full ChartQA benchmark yet
  • Best suited for simple bar, pie, and line chart questions

๐Ÿ™ Acknowledgements

Downloads last month
5
Safetensors
Model size
2B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for CloveAI/clov-vl-2b

Base model

Qwen/Qwen2-VL-2B
Adapter
(13)
this model

Dataset used to train CloveAI/clov-vl-2b