Instructions to use hhenryz/LOVA3-llava-v1.5-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use hhenryz/LOVA3-llava-v1.5-7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="hhenryz/LOVA3-llava-v1.5-7b")

# Load model directly
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("hhenryz/LOVA3-llava-v1.5-7b")
model = AutoModelForCausalLM.from_pretrained("hhenryz/LOVA3-llava-v1.5-7b")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use hhenryz/LOVA3-llava-v1.5-7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "hhenryz/LOVA3-llava-v1.5-7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hhenryz/LOVA3-llava-v1.5-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/hhenryz/LOVA3-llava-v1.5-7b

SGLang

How to use hhenryz/LOVA3-llava-v1.5-7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "hhenryz/LOVA3-llava-v1.5-7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hhenryz/LOVA3-llava-v1.5-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "hhenryz/LOVA3-llava-v1.5-7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hhenryz/LOVA3-llava-v1.5-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use hhenryz/LOVA3-llava-v1.5-7b with Docker Model Runner:
```
docker model run hf.co/hhenryz/LOVA3-llava-v1.5-7b
```

💡Key Contributions:

LOVA3 - To the best of our knowledge, LOVA3 is the first effort to imbue the asking and assessment abilities in training a robust and intelligent MLLM, inspired from human learning mechanism.
EvalQABench - We build a new benchmark EvalQABench for the VQA correction evaluation as the first effort to advance the development of future research.
Performance Improvement - Training with our proposed LOVA3 framework, we observe consistent improvement on 10 representative benchmarks.

Model weight

Pretrained weight: LOVA3-llava-v1.5-7b

Download it by using following command:

git clone https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b

Training Data

Here we provide the training/Evaluation/Testing sets of EvalQABench under the folder EvalQABench.
Training data: Mixed_VQA_GenQA_EvalQA_1.5M.jsonl.

Image Datasets

Please download the images from constituting datasets:

COCO: train2014
GQA: images
OCR-VQA: download script, we save all files as .jpg
AOKVQA: download script
TextVQA: train_val_images
VisualGenome: part1, part2
LLaVA-Instruct: huggingface

💃 Evaluation

Download LOVA3-llava-v1.5-7b under the folder checkpoints.
Download the CLIP vision encoder clip-vit-large-patch14-336 under the folder checkpoints
Run the evaluation scripts under the folder scripts/v1_5/eval. There are 12 multimodal datasets and benchmarks awaiting evaluation.

Take VizWiz as an example, the running command is as follows:

modelname=LOVA3-llava-v1.5-7b

python -m llava.eval.model_vqa_loader \
    --model-path checkpoints/$modelname \
    --question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --image-folder /yourpath/vizwiz/test/ \
    --answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

python scripts/convert_vizwiz_for_submission.py \
    --annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json

Training

Download the pretrained MLP adapter weights llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 from and put it under the folder checkpoints.
Download the model weight clip-vit-large-patch14-336 under the folder checkpoints.
Download the model weight vicuna-7b-v1.5 under the folder checkpoints.
Download the training data Mixed_VQA_GenQA_EvalQA_1.5M.jsonl under the folder data.
Run the training script.

bash scripts/v1_5/finetune.sh

🙏 Acknowledgement

LLaVA: The codebase we built upon.
LAVIS: We download some datasets from its scripts.

🎓 Citation

If you find LOVA3 useful, please cite using this BibTeX:

@inproceedings{
    zhao2024lova,
    title={{LOVA}3: Learning to Visual Question Answering, Asking and Assessment},
    author={Hengyuan Zhao and Pan Zhou and Difei Gao and Zechen Bai and Mike Zheng Shou},
    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year={2024},
    url={https://openreview.net/forum?id=vIOKLMl6wu}
}

Downloads last month: 8

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support