Instructions to use gannima/AutoGLM-Phone-9B-Multilingual-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use gannima/AutoGLM-Phone-9B-Multilingual-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="gannima/AutoGLM-Phone-9B-Multilingual-GGUF", filename="AutoGLM-Phone-9B-Multilingual-q4_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use gannima/AutoGLM-Phone-9B-Multilingual-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M
Use Docker
docker model run hf.co/gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use gannima/AutoGLM-Phone-9B-Multilingual-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "gannima/AutoGLM-Phone-9B-Multilingual-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gannima/AutoGLM-Phone-9B-Multilingual-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M
- Ollama
How to use gannima/AutoGLM-Phone-9B-Multilingual-GGUF with Ollama:
ollama run hf.co/gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M
- Unsloth Studio new
How to use gannima/AutoGLM-Phone-9B-Multilingual-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for gannima/AutoGLM-Phone-9B-Multilingual-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for gannima/AutoGLM-Phone-9B-Multilingual-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for gannima/AutoGLM-Phone-9B-Multilingual-GGUF to start chatting
- Docker Model Runner
How to use gannima/AutoGLM-Phone-9B-Multilingual-GGUF with Docker Model Runner:
docker model run hf.co/gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M
- Lemonade
How to use gannima/AutoGLM-Phone-9B-Multilingual-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull gannima/AutoGLM-Phone-9B-Multilingual-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.AutoGLM-Phone-9B-Multilingual-GGUF-Q4_K_M
List all available models
lemonade list
AutoGLM-Phone-9B-Multilingual (GGUF Quantizations)
This is a GGUF quantized version of zai-org/AutoGLM-Phone-9B-Multilingual, optimized for local inference with llama.cpp.
Includes vision encoder (mmproj) for multimodal capabilities and GUI agent tasks.
📦 Model Files
| File | Quantization | Size | VRAM | Description |
|---|---|---|---|---|
AutoGLM-Phone-9B-Multilingual-q4_k_m.gguf |
Q4_K_M | 5.7G | ~10GB | Performance balanced |
AutoGLM-Phone-9B-Multilingual-q5_k_m.gguf |
Q5_K_M | 6.6G | ~11GB | High quality |
AutoGLM-Phone-9B-Multilingual-q6_k.gguf |
Q6_K | 7.7G | ~12GB | Excellent quality |
AutoGLM-Phone-9B-Multilingual-q8_0.gguf |
Q8_0 | 9.4G | ~14GB | Best quality |
mmproj-AutoGLM-Phone-9B-Multilingual-F16.gguf |
F16 | 1.7G | - | Vision Encoder (required) |
Total storage: ~31GB (all quantizations + vision encoder)
🚀 Quick Start
1. Install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make llama-server
2. Download Model
huggingface-cli download gannima/AutoGLM-Phone-9B-Multilingual-GGUF \
--local-dir ./AutoGLM-Phone-9B-Multilingual \
--local-dir-use-symlinks False
3. Run Server
./llama-server \
-m AutoGLM-Phone-9B-Multilingual/AutoGLM-Phone-9B-Multilingual-q8_0.gguf \
--mmproj AutoGLM-Phone-9B-Multilingual/mmproj-AutoGLM-Phone-9B-Multilingual-F16.gguf \
-c 32768 \
-ngl 99 \
--flash-attn on \
--host 0.0.0.0 \
--port 8080
4. Use with Open-AutoGLM
cd Open-AutoGLM
python main.py \
--base-url http://localhost:8080 \
--model "AutoGLM-Phone-9B-Multilingual" \
--apikey dummy \
"打开设置应用" \
--max-steps 20
💻 Hardware Requirements
Quick Reference (Tested on RTX 4090)
| Quantization | Model Size | Vision Encoder | Total | Actual VRAM* | Quality |
|---|---|---|---|---|---|
| Q4_K_M | 5.7G | 1.7G | ~7.4G | ~10GB | Good |
| Q5_K_M | 6.6G | 1.7G | ~8.3G | ~11GB | Very Good |
| Q6_K | 7.7G | 1.7G | ~9.4G | ~12GB | Excellent |
| Q8_0 | 9.4G | 1.7G | ~11.1G | ~14GB | Best |
*VRAM usage measured with --flash-attn on and all layers on GPU (-ngl 99)
System Requirements
- OS: Linux (Ubuntu 22.04+ recommended), Windows 11 with WSL2
- RAM: 32GB+ system memory recommended
- Storage: SSD with sufficient space for model files
- CUDA: 12.0+ for GPU acceleration
- llama.cpp: Latest version with GLM4V support (PR #18042 merged)
Performance Notes
- Flash Attention: Enabled by default for better performance
- KV Cache: Quantized to Q8_0 to reduce memory usage
- Batch Size: Optimized for RTX 4090 (adjust based on your GPU)
- Context: Supports up to 32K tokens with M-RoPE
- All layers on GPU: Set
-ngl 99to offload all transformer layers to GPU
🎯 Recommended Usage
For GUI Agent Tasks (Recommended)
Use Q5_K_M or Q6_K for the best balance between quality and performance:
- Better reasoning accuracy
- Faster inference than Q8_0
- Lower VRAM usage
For Maximum Quality
Use Q8_0 when:
- You want the highest possible accuracy
- Running on RTX 4090 or better
- Complex multi-step GUI automation tasks
For Consumer GPUs
Use Q4_K_M when:
- Limited VRAM (12GB cards like RTX 4070)
- Need faster inference
- Running on gaming GPUs
📄 License
This model is governed by the GLM-4 License. Please refer to the original model repository for details: https://huggingface.co/zai-org/AutoGLM-Phone-9B-Multilingual
🙏 Acknowledgments
- Original Model: zai-org/AutoGLM-Phone-9B-Multilingual
- Conversion Tool: llama.cpp
- GLM4V Support: PR #18042
Conversion Date: 2025-12-29 llama.cpp Version: latest (with GLM4V support) Tested Hardware: RTX 4090 24GB
- Downloads last month
- 114
4-bit
5-bit
6-bit
8-bit
Model tree for gannima/AutoGLM-Phone-9B-Multilingual-GGUF
Base model
zai-org/GLM-4-9B-0414