Instructions to use IQuestLab/IQuest-Coder-V1-40B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use IQuestLab/IQuest-Coder-V1-40B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="IQuestLab/IQuest-Coder-V1-40B-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("IQuestLab/IQuest-Coder-V1-40B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use IQuestLab/IQuest-Coder-V1-40B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "IQuestLab/IQuest-Coder-V1-40B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "IQuestLab/IQuest-Coder-V1-40B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/IQuestLab/IQuest-Coder-V1-40B-Instruct

SGLang

How to use IQuestLab/IQuest-Coder-V1-40B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "IQuestLab/IQuest-Coder-V1-40B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "IQuestLab/IQuest-Coder-V1-40B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "IQuestLab/IQuest-Coder-V1-40B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "IQuestLab/IQuest-Coder-V1-40B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use IQuestLab/IQuest-Coder-V1-40B-Instruct with Docker Model Runner:
```
docker model run hf.co/IQuestLab/IQuest-Coder-V1-40B-Instruct
```

IQuest-Coder-V1 inference support and benchmarks

#10

by Geodd - opened Feb 5

Discussion

Geodd

Feb 5

We’re DeployPad, a team focused on high-performance inference for open-weight models.

We’ve added official support for the IQuest-Coder-V1 family, with the goal of supporting the IQuestLab community by making the model easier and more cost-efficient to run without changing how you interact with it.

Current performance
~50–80 tokens/sec
Batch size: 32

We are also planning to add lower cost GPU options, including RTX Pro 6000 and L40s, both running at FP8 precision, to further reduce deployment cost while maintaining performance.

Benchmarks and methodology are publicly available here
https://github.com/geoddllc/large-llm-inference-benchmarks

If you want to try it yourself, you can deploy via the console
https://console.geodd.io/

(top up and run no platform fees beyond compute)

Feedback, questions, or requests from the community are welcome feel free to leave a comment below.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment