Instructions to use TheBloke/starcoderplus-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TheBloke/starcoderplus-GPTQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TheBloke/starcoderplus-GPTQ")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TheBloke/starcoderplus-GPTQ") model = AutoModelForCausalLM.from_pretrained("TheBloke/starcoderplus-GPTQ") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TheBloke/starcoderplus-GPTQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TheBloke/starcoderplus-GPTQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/starcoderplus-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TheBloke/starcoderplus-GPTQ
- SGLang
How to use TheBloke/starcoderplus-GPTQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TheBloke/starcoderplus-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/starcoderplus-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TheBloke/starcoderplus-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/starcoderplus-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TheBloke/starcoderplus-GPTQ with Docker Model Runner:
docker model run hf.co/TheBloke/starcoderplus-GPTQ
Issues running model
Since the model_basename is not originally provided in the example code, I tried this:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse
model_name_or_path = "TheBloke/starcoderplus-GPTQ"
model_basename = "gptq_model-4bit--1g.safetensors"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
print("\n\n*** Generate:")
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda:0")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
But I always get the following:
FileNotFoundError: could not find model TheBloke/starcoderplus-GPTQ
When I remove the model_basename parameter, it downloads, but I get the following error with generate:
The safetensors archive passed at ~/.cache/huggingface/hub/models--TheBloke--starcoderplus-GPTQ/snapshots/aa67ff4fad65fc88f6281f3a2bcc0d648105ef96/gptq_model-4bit--1g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
*** Generate:
TypeError: generate() takes 1 positional argument but 2 were given
I am just using the original code provided, with no other alterations. I am able to load other models from your HF repos with autogptq but not this one specifically
Hmm you shouldn't need model_basename for this. Maybe that's an AutoGPTQ bug.
When it is required, you leave out the .safetensors from the end, so it'smodel_basename=gptq_model-4bit--1g
Thank you for the insight. Do you have an idea of why the generate() issue is occurring when I remove the model_basename? Here is my code when I do:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse
model_name_or_path = "TheBloke/starcoderplus-GPTQ"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
print("\n\n*** Generate:")
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda:0")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
As discussed on Discord:
This is caused by this bug: https://github.com/PanQiWei/AutoGPTQ/pull/135
Workaround is model.generate(inputs=inputs)
Fix: i just needed to set inputs=inputs on the generate command, TheBloke submitted a fix but did not have his MR accepted yet on the autogptq github.
Ensure the script uses the no-model-basename version I provided above.
