Instructions to use unsloth/GLM-4.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unsloth/GLM-4.7-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="unsloth/GLM-4.7-GGUF") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("unsloth/GLM-4.7-GGUF", dtype="auto") - llama-cpp-python
How to use unsloth/GLM-4.7-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/GLM-4.7-GGUF", filename="BF16/GLM-4.7-BF16-00001-of-00015.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use unsloth/GLM-4.7-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./llama-cli -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./build/bin/llama-cli -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
Use Docker
docker model run hf.co/unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
- LM Studio
- Jan
- vLLM
How to use unsloth/GLM-4.7-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/GLM-4.7-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/GLM-4.7-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
- SGLang
How to use unsloth/GLM-4.7-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unsloth/GLM-4.7-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/GLM-4.7-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unsloth/GLM-4.7-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/GLM-4.7-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use unsloth/GLM-4.7-GGUF with Ollama:
ollama run hf.co/unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
- Unsloth Studio new
How to use unsloth/GLM-4.7-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/GLM-4.7-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/GLM-4.7-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/GLM-4.7-GGUF to start chatting
- Pi new
How to use unsloth/GLM-4.7-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "unsloth/GLM-4.7-GGUF:UD-Q4_K_XL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use unsloth/GLM-4.7-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
Run Hermes
hermes
- Docker Model Runner
How to use unsloth/GLM-4.7-GGUF with Docker Model Runner:
docker model run hf.co/unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
- Lemonade
How to use unsloth/GLM-4.7-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
Run and chat with the model
lemonade run user.GLM-4.7-GGUF-UD-Q4_K_XL
List all available models
lemonade list
GLM-4.7-Q8_0.gguf capabilities with Ollama
$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0
Capabilities
completion
missing tools and thinking capabilities . I used llama.cpp to merge all the guff files. any idea or feedback guys?
./llama-gguf-split --merge ../GLM-4.7-Q8_0-00001-of-00008.gguf ../GLM-4.7-Q8_0.gguf
echo "FROM GLM-4.7-Q8_0.gguf" > "GLM-4.7-Q8_0.model"
ollama create GLM-4.7-Q8_0 -f GLM-4.7-Q8_0.model
managed to make it worked by using below model file:
FROM GLM-4.7-Q8_0.gguf
SYSTEM """You are a reasoning-focused assistant.
Use ... for internal reasoning.
Provide a concise final answer after thinking.
"""
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<|User|>{{ .Content }}
{{- else if eq .Role "assistant" }}<|Assistant|>
{{- if and $.IsThinkSet (and $last .Thinking) -}}
{{ .Thinking }}
{{- end }}{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<|Assistant|>
{{- if and $.IsThinkSet (not $.Think) -}}
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER min_p 0.01
PARAMETER repeat_penalty 1
PARAMETER num_predict 16384
PARAMETER num_ctx 16384
PARAMETER num_gpu -1
PARAMETER stop <|end▁of▁sentence|>
PARAMETER stop <|User|>
$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0
Capabilities
completion
thinking
Parameters
min_p 0.01
num_ctx 16384
num_gpu -1
num_predict 16384
repeat_penalty 1
stop "<|end▁of▁sentence|>"
stop "<|User|>"
temperature 0.6
top_p 0.95
System
You are a reasoning-focused assistant.
Use ... for internal reasoning.
...
I don't use tools in these, but Q8 in my tests so-so, need to try BF16 almost original, frankly i got 768Gb RAM on ancient year 2014 Xeon motherboard, 4,5 BF16 runs perfectly in the past in oobabooga (in LMStudio such sizes always crashing), in pure llama,cpp server all chats can be lost if comp reboots(very often problem with such large models filling 99% memory).
Prompt hacks not helping in Q8. The Q8 in GLM4.5 from Unsloth was much better by result of code.
Guide to use oobabooga:
- download latest release of text-gen portable on Github (its distributed in one packet like ComfyUI for Windows), unzip file.
2.drop your models into models folders in user_data (in GGUF=size of file=amount of RAM (RAM+VRAM) needed). Super large models like Kimi K2 or Deepseek Speciale can be used only from external drive obviously, so for that need to be written path in CMD_FLAGS.txt file (in user_data folder) like --model-dir /drive/your/model/folder
3.launch by start_linux(or etc) in web browser (do not use high RAM consuming browsers like Chrome) - in Model section choose your model then tune launch settings:
4.1 gpu-layers if you want to use GPU+CPU or put 0 for CPU only
4.2 ctx-size important - context size of discussed topic, more=more RAM
4.3 cpu-moe, streaming-llm on your choice
4.4 Other options is important - Threads is number of your CPU cores, threads_batch is number of CPU threads
4.5 batch_size can be played after, this number affect loading of answer on prompt
4.6 no-mmap and numa can be used by some, as i remember its against using storage drive for model space and for non-uniform memory
4.7 many other setting can be played - click Load button above and wait for confirmation that model loaded into RAM (or RAM+VRAM), useful to use any system resources app to check used RAM, with very big models usually all RAM used with leaving only minimum for OS itself, so RAM-eating apps need to be removed if model not loaded or super slow (when usually Linux started using SSD drive for model space)
user_data folder with all models/chats/settings can be migrated into any next new version of oobabooga
Oobabooga also distributed in docker container, maybe for corporate environments http://github.com/ashleykleynhans/text-generation-docker
I would probably use https://ollama.com/MichelRosselli/GLM-4.6:latest/blobs/e683b5dab156 's chat template for Ollama - they also utilize our quants, so I'm assuming these chat templates work for Ollama
FROM GLM-4.7-Q8_0.gguf
[gMASK]
{{- if .Tools }}<|system|>
Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within XML tags:
{{- range .Tools }}
{"function": {{ .Function }}}
{{- end }}
For each function call, return a json object with function name and arguments within XML tags:
{"name": , "arguments": }
{{- end -}}
{{- $lastUserIdx := -1 }}
{{- range $i, $_ := .Messages }}
{{- if eq .Role "user" }}{{- $lastUserIdx = $i }}{{ end }}
{{- end -}}
{{- $prevWasTool := false -}}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- $curIsTool := eq .Role "tool" -}}
{{- $startToolBlock := and $curIsTool (not $prevWasTool) -}}
{{- if eq .Role "user" }}<|user|>
{{ .Content }}
{{- if and $.IsThinkSet (not $.Think) -}}
/nothink
{{- end -}}
{{- else if eq .Role "assistant" }}<|assistant|>
{{- if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) }}
{{ .Thinking }}
{{- else if $.IsThinkSet }}
{{- end }}
{{- if .Content }}
{{ .Content }}
{{- end -}}
{{ if .ToolCalls }}
{{- range .ToolCalls }}
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{- end }}
{{- end }}
{{- else if $curIsTool -}}
{{ if not $prevWasTool }}<|observation|>
{{- end }}
{{ .Content }}
{{- $prevWasTool = true -}}
{{- else if eq .Role "system" -}}<|system|>
{{ .Content }}
{{- end }}
{{- if and (ne .Role "assistant") $last }}<|assistant|>
{{- if and $.IsThinkSet (not $.Think) }}
{{- end -}}
{{- end }}
{{- $prevWasTool = $curIsTool -}}
{{- end }}
$ ollama create GLM-4.7-Q8_0 -f GLM-4.7-Q8_0-3.model
Error: (line 3): command must be one of "from", "license", "template", "system", "adapter", "renderer", "parser", "parameter", or "message"
$ cat GLM-4.7-Q8_0-3.model
FROM GLM-4.7-Q8_0.gguf
SYSTEM """You are a reasoning-focused assistant with tool-calling capabilities.
- Use ... for internal reasoning.
- If a tool is needed, use it.
- When you receive a tool observation, incorporate it into your final concise answer.
"""
TEMPLATE """[gMASK]{{- if .System }}<|system|>
{{ .System }}
{{- if .Tools }}
Available tools:
{{- range .Tools }}
{{ . }}
{{- end }}
{{- end }}
{{- end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|user|>
{{ .Content }}
{{- else if eq .Role "assistant" }}<|assistant|>
{{- if .Thinking }}
{{ .Thinking }}
{{ end }}
{{- if .ToolCalls }}<|observation|>
{{- range .ToolCalls }}
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{- end }}
{{- else }}
{{ .Content }}<|endoftext|>
{{- end }}
{{- else if eq .Role "tool" }}<|observation|>
{{ .Content }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<|assistant|>
{{- end }}
{{- end }}"""
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|observation|>"
$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0
Capabilities
completion
tools
thinking
Parameters
stop "<|user|>"
stop "<|assistant|>"
stop "<|endoftext|>"
stop "<|observation|>"
temperature 0.6
top_p 0.95
System
You are a reasoning-focused assistant with tool-calling capabilities.
1. Use ... for internal reasoning.
...
the latest model file works but tested the tools calling with vs code not so working, appreciate others input
@Danielhanchen , should I download latest GLM-4.7-Q8_0-00001-of-00008.gguf and retry ?
Yes please do!