Instructions to use qnguyen3/nanoLLaVA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use qnguyen3/nanoLLaVA with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="qnguyen3/nanoLLaVA", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("qnguyen3/nanoLLaVA", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use qnguyen3/nanoLLaVA with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "qnguyen3/nanoLLaVA"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qnguyen3/nanoLLaVA",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/qnguyen3/nanoLLaVA

SGLang

How to use qnguyen3/nanoLLaVA with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "qnguyen3/nanoLLaVA" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qnguyen3/nanoLLaVA",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "qnguyen3/nanoLLaVA" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qnguyen3/nanoLLaVA",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use qnguyen3/nanoLLaVA with Docker Model Runner:
```
docker model run hf.co/qnguyen3/nanoLLaVA
```

Run on Macbook without flash_attn?

by palebluewanders - opened Apr 12, 2024

Discussion

palebluewanders

Apr 12, 2024

Supposedly CPU is supported right? I'm trying to run this on Apple silicon, but cannot get past the flash_attn requirement which is for NVIDIA GPUs. How can I get around this?

Traceback (most recent call last):
  File "/Users/demospace/Desktop/cbd/nanoLLaVA/nanoLLaVA.py", line 16, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/demospace/cbd/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 550, in from_pretrained
    model_class = get_class_from_dynamic_module(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/demospace/cbd/lib/python3.12/site-packages/transformers/dynamic_module_utils.py", line 489, in get_class_from_dynamic_module
    final_module = get_cached_module_file(
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/demospace/cbd/lib/python3.12/site-packages/transformers/dynamic_module_utils.py", line 315, in get_cached_module_file
    modules_needed = check_imports(resolved_module_file)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/demospace/cbd/lib/python3.12/site-packages/transformers/dynamic_module_utils.py", line 180, in check_imports
    raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run `pip install flash_attn

qnguyen3

Owner Apr 12, 2024

Hi @palebluewanders , can you try this and see what it gives back? FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn --no-build-isolation

CoderCowMoo

Apr 16, 2024

Hi I'm on the windows platform, facing the same issue. Doing the above, still doesn't let me install the flash-attn module.

Error limit reached.
100 errors detected in the compilation of "csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu".
Compilation terminated.
error: command 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc.exe' failed with exit code 4294967295
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash-attn
Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects

Is there a way to use a different attention implementation, e.g. SDPA? My script fails when it tries to run the AutoModelForCausalLM line, so I wouldn't be able to include SDPA, as per this documentation, before it errors.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment