Instructions to use unsloth/GLM-4.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/GLM-4.7-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/GLM-4.7-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("unsloth/GLM-4.7-GGUF", dtype="auto")

llama-cpp-python

How to use unsloth/GLM-4.7-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/GLM-4.7-GGUF",
	filename="BF16/GLM-4.7-BF16-00001-of-00015.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use unsloth/GLM-4.7-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./llama-cli -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL

Use Docker

docker model run hf.co/unsloth/GLM-4.7-GGUF:UD-Q4_K_XL

LM Studio
Jan

vLLM

How to use unsloth/GLM-4.7-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/GLM-4.7-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.7-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/GLM-4.7-GGUF:UD-Q4_K_XL

SGLang

How to use unsloth/GLM-4.7-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/GLM-4.7-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.7-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/GLM-4.7-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.7-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use unsloth/GLM-4.7-GGUF with Ollama:
```
ollama run hf.co/unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
```

Unsloth Studio new

How to use unsloth/GLM-4.7-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/GLM-4.7-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/GLM-4.7-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/GLM-4.7-GGUF to start chatting

Pi new

How to use unsloth/GLM-4.7-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/GLM-4.7-GGUF:UD-Q4_K_XL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/GLM-4.7-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/GLM-4.7-GGUF:UD-Q4_K_XL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/GLM-4.7-GGUF:UD-Q4_K_XL

Run Hermes

hermes

Docker Model Runner
How to use unsloth/GLM-4.7-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/GLM-4.7-GGUF:UD-Q4_K_XL
```

Lemonade

How to use unsloth/GLM-4.7-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/GLM-4.7-GGUF:UD-Q4_K_XL

Run and chat with the model

lemonade run user.GLM-4.7-GGUF-UD-Q4_K_XL

List all available models

lemonade list

GLM-4.7-Q8_0.gguf capabilities with Ollama

by myfenris - opened Dec 25, 2025

Discussion

myfenris

Dec 25, 2025

$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0

Capabilities
completion

missing tools and thinking capabilities . I used llama.cpp to merge all the guff files. any idea or feedback guys?

myfenris

Dec 25, 2025

./llama-gguf-split --merge ../GLM-4.7-Q8_0-00001-of-00008.gguf ../GLM-4.7-Q8_0.gguf

echo "FROM GLM-4.7-Q8_0.gguf" > "GLM-4.7-Q8_0.model"

ollama create GLM-4.7-Q8_0 -f GLM-4.7-Q8_0.model

myfenris

Dec 25, 2025

managed to make it worked by using below model file:

FROM GLM-4.7-Q8_0.gguf

SYSTEM """You are a reasoning-focused assistant.
Use ... for internal reasoning.
Provide a concise final answer after thinking.
"""

TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<｜User｜>{{ .Content }}
{{- else if eq .Role "assistant" }}<｜Assistant｜>
{{- if and $.IsThinkSet (and $last .Thinking) -}}

{{ .Thinking }}

{{- end }}{{ .Content }}{{- if not $last }}<｜end▁of▁sentence｜>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<｜Assistant｜>
{{- if and $.IsThinkSet (not $.Think) -}}

{{ end }} {{- end -}} {{- end }}"""

PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER min_p 0.01
PARAMETER repeat_penalty 1
PARAMETER num_predict 16384
PARAMETER num_ctx 16384
PARAMETER num_gpu -1

PARAMETER stop <｜end▁of▁sentence｜>
PARAMETER stop <｜User｜>

myfenris

Dec 25, 2025

$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0

Capabilities
completion
thinking

Parameters
min_p 0.01
num_ctx 16384
num_gpu -1
num_predict 16384
repeat_penalty 1
stop "<｜end▁of▁sentence｜>"
stop "<｜User｜>"
temperature 0.6
top_p 0.95

System
You are a reasoning-focused assistant.
Use ... for internal reasoning.
...

krustik

Dec 25, 2025

•

edited Dec 25, 2025

I don't use tools in these, but Q8 in my tests so-so, need to try BF16 almost original, frankly i got 768Gb RAM on ancient year 2014 Xeon motherboard, 4,5 BF16 runs perfectly in the past in oobabooga (in LMStudio such sizes always crashing), in pure llama,cpp server all chats can be lost if comp reboots(very often problem with such large models filling 99% memory).
Prompt hacks not helping in Q8. The Q8 in GLM4.5 from Unsloth was much better by result of code.
Guide to use oobabooga:

download latest release of text-gen portable on Github (its distributed in one packet like ComfyUI for Windows), unzip file.
2.drop your models into models folders in user_data (in GGUF=size of file=amount of RAM (RAM+VRAM) needed). Super large models like Kimi K2 or Deepseek Speciale can be used only from external drive obviously, so for that need to be written path in CMD_FLAGS.txt file (in user_data folder) like --model-dir /drive/your/model/folder
3.launch by start_linux(or etc) in web browser (do not use high RAM consuming browsers like Chrome)
in Model section choose your model then tune launch settings:
4.1 gpu-layers if you want to use GPU+CPU or put 0 for CPU only
4.2 ctx-size important - context size of discussed topic, more=more RAM
4.3 cpu-moe, streaming-llm on your choice
4.4 Other options is important - Threads is number of your CPU cores, threads_batch is number of CPU threads
4.5 batch_size can be played after, this number affect loading of answer on prompt
4.6 no-mmap and numa can be used by some, as i remember its against using storage drive for model space and for non-uniform memory
4.7 many other setting can be played
click Load button above and wait for confirmation that model loaded into RAM (or RAM+VRAM), useful to use any system resources app to check used RAM, with very big models usually all RAM used with leaving only minimum for OS itself, so RAM-eating apps need to be removed if model not loaded or super slow (when usually Linux started using SSD drive for model space)
user_data folder with all models/chats/settings can be migrated into any next new version of oobabooga
Oobabooga also distributed in docker container, maybe for corporate environments http://github.com/ashleykleynhans/text-generation-docker

krustik

Dec 25, 2025

This comment has been hidden (marked as Off-Topic)

victor

Dec 27, 2025

Can you stop the AI spam @krustik ?

krustik

Dec 27, 2025

Excuse me, but i've thought that was related.
Is there any function of "hide" on this portal like used in many forums which shows content only if user interested? @victor

adamm-hf

Dec 28, 2025

Hi @krustik ! 🤗 If possible, if you can keep the AI generated segments relatively short that would be great so they're not huge walls of text in each discussion. Some short examples should be perfect to show the differences between models / quants etc

danielhanchen

Unsloth AI org Dec 28, 2025

I would probably use https://ollama.com/MichelRosselli/GLM-4.6:latest/blobs/e683b5dab156 's chat template for Ollama - they also utilize our quants, so I'm assuming these chat templates work for Ollama

myfenris

Dec 28, 2025

FROM GLM-4.7-Q8_0.gguf

[gMASK]

{{- if .Tools }}<|system|>

Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within XML tags:

{{- range .Tools }}
{"function": {{ .Function }}}
{{- end }}

For each function call, return a json object with function name and arguments within XML tags:

{"name": , "arguments": }

{{- end -}}

{{- $lastUserIdx := -1 }}
{{- range $i, $_ := .Messages }}
{{- if eq .Role "user" }}{{- $lastUserIdx = $i }}{{ end }}
{{- end -}}

{{- $prevWasTool := false -}}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- $curIsTool := eq .Role "tool" -}}
{{- $startToolBlock := and $curIsTool (not $prevWasTool) -}}

{{- if eq .Role "user" }}<|user|>
{{ .Content }}

{{- if and $.IsThinkSet (not $.Think) -}}
/nothink
{{- end -}}

{{- else if eq .Role "assistant" }}<|assistant|>
{{- if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) }}
{{ .Thinking }}
{{- else if $.IsThinkSet }}

{{- end }}
{{- if .Content }}
{{ .Content }}
{{- end -}}

{{ if .ToolCalls }}

{{- range .ToolCalls }}
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{- end }}

{{- end }}

{{- else if $curIsTool -}}
{{ if not $prevWasTool }}<|observation|>
{{- end }}

{{ .Content }}

{{- $prevWasTool = true -}}

{{- else if eq .Role "system" -}}<|system|>
{{ .Content }}
{{- end }}

{{- if and (ne .Role "assistant") $last }}<|assistant|>
{{- if and $.IsThinkSet (not $.Think) }}

{{- end -}}
{{- end }}

$ ollama create GLM-4.7-Q8_0 -f GLM-4.7-Q8_0-3.model

Error: (line 3): command must be one of "from", "license", "template", "system", "adapter", "renderer", "parser", "parameter", or "message"

myfenris

Dec 28, 2025

$ cat GLM-4.7-Q8_0-3.model
FROM GLM-4.7-Q8_0.gguf

SYSTEM """You are a reasoning-focused assistant with tool-calling capabilities.

Use ... for internal reasoning.
If a tool is needed, use it.
When you receive a tool observation, incorporate it into your final concise answer.
"""

TEMPLATE """[gMASK]{{- if .System }}<|system|>
{{ .System }}
{{- if .Tools }}
Available tools:
{{- range .Tools }}
{{ . }}
{{- end }}
{{- end }}
{{- end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|user|>
{{ .Content }}
{{- else if eq .Role "assistant" }}<|assistant|>
{{- if .Thinking }}
{{ .Thinking }}
{{ end }}
{{- if .ToolCalls }}<|observation|>
{{- range .ToolCalls }}
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{- end }}
{{- else }}
{{ .Content }}<|endoftext|>
{{- end }}
{{- else if eq .Role "tool" }}<|observation|>
{{ .Content }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<|assistant|>

{{- end }}
{{- end }}"""

$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0

Capabilities
completion
tools
thinking

System
You are a reasoning-focused assistant with tool-calling capabilities.
1. Use ... for internal reasoning.
...

the latest model file works but tested the tools calling with vs code not so working, appreciate others input

myfenris

Dec 28, 2025

@Danielhanchen , should I download latest GLM-4.7-Q8_0-00001-of-00008.gguf and retry ?

danielhanchen

Unsloth AI org Dec 30, 2025

@Danielhanchen , should I download latest GLM-4.7-Q8_0-00001-of-00008.gguf and retry ?

Yes please do!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment