Instructions to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF",
	filename="IQ2_XXS/GLM-4.7-PRISM-IQ2_XXS-00001-of-00002.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./llama-cli -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL

Use Docker

docker model run hf.co/AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL

LM Studio
Jan
Ollama
How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Ollama:
```
ollama run hf.co/AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
```

Unsloth Studio new

How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF to start chatting

Pi new

How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL

Run Hermes

hermes

Docker Model Runner
How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Docker Model Runner:
```
docker model run hf.co/AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
```

Lemonade

How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL

Run and chat with the model

lemonade run user.GLM-4.7-PRISM-Unsloth-GGUF-UD-Q4_K_XL

List all available models

lemonade list

Request for Q8 or Q6

by silverfangx - opened Jan 27

Discussion

silverfangx

Jan 27

Thank you so much for making Q4_K_XL. If you could make Q8 or Q6, I would really appreciate it.

AliceThirty

Owner Jan 28

I have the files ready, but I tried four times to upload them, and huggingface always raises an error:

Bad request for commit endpoint:
Your push was rejected because an LFS pointer pointed to a file that does not exist. For instance, this can happen if you used git push --no-verify to push your changes. Offending file: - UD-Q8_K_XL/GLM-4.7-PRISM-UD-Q8_K_XL-00001-of-00008.gguf

And the faulty file is never the same, this is completely random. I will try to upload them by hand one by one, but this is driving me crazy. (Could it be that I didn't pay for huggingface-PRO?)

silverfangx

Jan 28

Have you try manual Re-tracking? I don't know if this will help or not.
git add --renormalize UD-Q8_K_XL/GLM-4.7-PRISM-UD-Q8_K_XL-00001-of-00008.gguf
git commit -m "Fix LFS pointer"
git push

AliceThirty

Owner Jan 28

•

edited Jan 28

I don't use git, I use the huggingface python library

from huggingface_hub.hf_api import HfApi
api = HfApi()
api.upload_folder(
    repo_id="AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF",
    folder_path="quantization/GLM-4.7-PRISM/UD-Q8_K_XK",
    path_in_repo="UD-Q8_K_XK",
    token="abcdefgh",
)

with the python libraries hf-xet 1.2.0 and huggingface_hub 1.2.3

I "solved" this by uploading the files one by one and restarting when it fails... But this should be the role of huggingface to do this, not mine. Imagine if TCP requires an human verification for every packet

AliceThirty

Owner Jan 28

•

edited Jan 28

Let me know if it works for you. By the way, what is your hardware and your generation speed please? I have a 9950x3D and it's max capacity is 192GB of ram (6000MT/s), which is why I initially quantized the Q4 version, and my generation speed is 5 t/s. The Q4 version barely fits in my ram if I offload some layers on my 5090

silverfangx

Jan 28

My hardware is a potato, haha. I just use cloud computer. When I ran the Q8_K_XL on an AMD EPYC 64-core, I was getting [Prompt: 8.0 t/s | Generation: 2.0 t/s]. At that speed, it is unusable on the CPU. Maybe there are some optimizations that could improve the speed, but I’m planning to run the whole model on 8x RTX A6000s. Thanks again for uploading the Q8 and Q6 versions!

AliceThirty

Owner Jan 29

Ah yes, I see why yours is so slow. The Amd EPYC 64-core is compatible with 4800 MT/s ram. So the ram is slower, and the there are twice as much data to load per token (Q8 vs Q4), so 6000Hz/4800Hz * 8/4 = 2.5. My generation speed is 2.5 * faster than yours. If you offload on the GPUs, the CPU will have much more less memory to load and it will be much faster

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment