AesSedai/GLM-5-GGUF using the Q4_K_M with 100K of context on 8 RTX 3090

by martossien - opened 26 days ago

Just sharing a successful local inference report for AesSedai/GLM-5-GGUF using the Q4_K_M quant, listed at ~432.8 GiB (4.93 BPW) on the model card.

Machine

CPU: AMD EPYC 7532, 32 cores / 64 threads
RAM: 503 GiB DDR4 2933 MHz (≈78 GiB used, ≈424 GiB free during tests)
GPU: 8 × NVIDIA GeForce RTX 3090 (24 GiB each)
NVLink: GPU 1↔5 and GPU 3↔6 (4 active links total with 2 nvlink)
OS: Fedora Linux 42, kernel 6.18.x
Serving stack: latest ik_llama.cpp (CUDA build)
NVIDIA driver: 580.126.09

Model files were downloaded via LM Studio only; all inference is done with ik_llama.cpp.

Command
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/AesSedai/GLM-5-GGUF/GLM-5-Q4_K_M-00001-of-00011.gguf --alias GLM-5-Q4_K_M --host 0.0.0.0 --port 8080 --ctx-size 102400 --no-mmap --threads 32 --threads-batch 64 --batch-size 1024 --ubatch-size 1024 --parallel 1 --flash-attn on --n-gpu-layers 999 --split-mode graph --split-mode-graph-scheduling --tensor-split 0.65,1,1,1,0.9,1,1,0.75 --cpu-moe --n-cpu-moe 248 --cache-type-k q6_0 --cache-type-v q4_0 --k-cache-hadamard --graph-reuse -muge --cache-ram 32768 --jinja

Note: for this GGUF, ik_llama.cpp prints Split mode 'graph' is not supported for this model => changing split mode to 'layer', so in practice it falls back to layer parallelism, and --tensor-split has negligible impact on VRAM distribution.

Performance (measured via opencode)
Prompt phase: ~25.9 tokens/s
Generation phase: ~6.1 tokens/s

This is with --parallel 1 and long‑context prompts (~100k tokens) on the 8×3090 setup.

Many thanks again to AesSedai for making this quant available – it’s impressive to see a 744B‑parameter MoE model with 40B active parameters running locally at this scale.

This test is very long to make, the load of the llm take 45 mn ... 410 Go in ram ... so 1 day , now i must test with my dev team ...

AesSedai

Owner 26 days ago

Thanks for testing this one! I've improved my methodology (and note-taking) since making this quant and I should come back to it and clean the model page up to my new standards :)

I'm revisiting K2.5 currently because it looks like the IQ3_S and IQ2_S quants are actually not performing correctly. So if you plan on testing those, I'd recommend sticking to the Q4_X (full quality) or IQ2_XXS which doesn't show the same issues as the other two quants.

I'll likely revisit GLM-4.5 and GLM-4.6 similarly to clean them up too.

havenoammo

26 days ago

•

edited 26 days ago

@martossien cool setup and great test, thank you for sharing this. is there possibility that you can try Qwen3.5 397B with your setup too? I wanted to build something like yours, just wondering how much t/s it would get with 8x3090s. I don't even have enough ram to fit the model at q4, but tried mmap and streaming from nvme, and was able to get 3t/s which is unusable but still it was exciting to even able to run the model.

nvm saw you already shared Qwen test results a day ago, thanks for that too!

AesSedai

Owner 26 days ago

@havenoammo my Qwen3.5 quants (including 397B) use the newer fused up + gate conversion and if you're doing mixed RAM + VRAM offloading there is a pretty nasty TG speed penalty at the moment. I opened an issue about it (https://github.com/ggml-org/llama.cpp/issues/20883) and there is a PR that should fix it (https://github.com/ggml-org/llama.cpp/pull/20910).

Just something to keep in mind until that PR gets merged.

havenoammo

26 days ago

@AesSedai thanks for the heads up, I will watch out for this PR and will try again once merged.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment