Instructions to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF", filename="IQ2_XXS/GLM-4.7-PRISM-IQ2_XXS-00001-of-00002.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./llama-cli -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./build/bin/llama-cli -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
Use Docker
docker model run hf.co/AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
- LM Studio
- Jan
- Ollama
How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Ollama:
ollama run hf.co/AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
- Unsloth Studio new
How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF to start chatting
- Pi new
How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
Run Hermes
hermes
- Docker Model Runner
How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Docker Model Runner:
docker model run hf.co/AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
- Lemonade
How to use AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF:UD-Q4_K_XL
Run and chat with the model
lemonade run user.GLM-4.7-PRISM-Unsloth-GGUF-UD-Q4_K_XL
List all available models
lemonade list
Request for Q8 or Q6
Thank you so much for making Q4_K_XL. If you could make Q8 or Q6, I would really appreciate it.
I have the files ready, but I tried four times to upload them, and huggingface always raises an error:
Bad request for commit endpoint:
Your push was rejected because an LFS pointer pointed to a file that does not exist. For instance, this can happen if you used git push --no-verify to push your changes. Offending file: - UD-Q8_K_XL/GLM-4.7-PRISM-UD-Q8_K_XL-00001-of-00008.gguf
And the faulty file is never the same, this is completely random. I will try to upload them by hand one by one, but this is driving me crazy. (Could it be that I didn't pay for huggingface-PRO?)
Have you try manual Re-tracking? I don't know if this will help or not.
git add --renormalize UD-Q8_K_XL/GLM-4.7-PRISM-UD-Q8_K_XL-00001-of-00008.gguf
git commit -m "Fix LFS pointer"
git push
I don't use git, I use the huggingface python library
from huggingface_hub.hf_api import HfApi
api = HfApi()
api.upload_folder(
repo_id="AliceThirty/GLM-4.7-PRISM-Unsloth-GGUF",
folder_path="quantization/GLM-4.7-PRISM/UD-Q8_K_XK",
path_in_repo="UD-Q8_K_XK",
token="abcdefgh",
)
with the python libraries hf-xet 1.2.0 and huggingface_hub 1.2.3
I "solved" this by uploading the files one by one and restarting when it fails... But this should be the role of huggingface to do this, not mine. Imagine if TCP requires an human verification for every packet
Let me know if it works for you. By the way, what is your hardware and your generation speed please? I have a 9950x3D and it's max capacity is 192GB of ram (6000MT/s), which is why I initially quantized the Q4 version, and my generation speed is 5 t/s. The Q4 version barely fits in my ram if I offload some layers on my 5090
My hardware is a potato, haha. I just use cloud computer. When I ran the Q8_K_XL on an AMD EPYC 64-core, I was getting [Prompt: 8.0 t/s | Generation: 2.0 t/s]. At that speed, it is unusable on the CPU. Maybe there are some optimizations that could improve the speed, but I’m planning to run the whole model on 8x RTX A6000s. Thanks again for uploading the Q8 and Q6 versions!
Ah yes, I see why yours is so slow. The Amd EPYC 64-core is compatible with 4800 MT/s ram. So the ram is slower, and the there are twice as much data to load per token (Q8 vs Q4), so 6000Hz/4800Hz * 8/4 = 2.5. My generation speed is 2.5 * faster than yours. If you offload on the GPUs, the CPU will have much more less memory to load and it will be much faster