Apr 11: Updated with Google chat template fixes + more

#16
by danielhanchen - opened
Unsloth AI org

Hey everyone, we’ve updated the quants again to include all of Google’s official chat template fixes (which fixed/improved tool-calling), along with the latest llama.cpp fixes.

We know there have been a lot of re-downloading lately, so we appreciate your patience. We’re pushing updates whenever fixes become available to make sure you always have the latest and best-performing quants.

NVIDIA is working on the CUDA 13.2 issue. Until it is fixed, do not use CUDA 13.2.

danielhanchen pinned discussion

do we need to use this parameter with llama.cpp:

--chat-template-file google-gemma-4-31B-it-interleaved.jinja

?

latest gemma-4-31B-it-UD-Q4_K_XL.gguf is literary twice as slow for me in LMStudio, compared to the one that was downloaded a week ago or so.

I can't even RUN this version now... How can I download the old one???

/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: SCALE failed
CUDA error: device kernel image is invalid

The latest Q8XL UD - (11 Apr 4:47GMT) is far worse at reasoning/math than any of the prior downloads of this dynamic quant. It's worse than the current Q4XL UD (11 Apr 4:33GMT) when tested on AIME level math problems.

ollama Error: 500 Internal Server Error: unable to load model: C:\Users\xxx.ollama\models\blobs\sha256-xxxxxxxxxxxxxxxxxxx

also experiencing slower inference speeds and less accurate responses as well, but maybe i need more testing

do we need to use this parameter with llama.cpp:

--chat-template-file google-gemma-4-31B-it-interleaved.jinja

?

Yes you will still need to use llama.cpp's interleaved chat template I don't think Google has updated it

latest gemma-4-31B-it-UD-Q4_K_XL.gguf is literary twice as slow for me in LMStudio, compared to the one that was downloaded a week ago or so.

also experiencing slower inference speeds and less accurate responses as well, but maybe i need more testing

We didn't change anything with the layers, so it might be llama.cpp or LM Studio not updated to the latest.

I can't even RUN this version now... How can I download the old one???

/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: SCALE failed
CUDA error: device kernel image is invalid

Did you update to the latest version of llama.cpp?

Unsloth AI org

ollama Error: 500 Internal Server Error: unable to load model: C:\Users\xxx.ollama\models\blobs\sha256-xxxxxxxxxxxxxxxxxxx

GGUFs with separate mmproj files are not supported in Ollama. Use llama.cpp supported backends.

Meanwhile you can try and "remove" the vision model part:

Run this command to create a Modelfile:
ollama show --modelfile hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M > gemma_4_unsloth_modelfile

Then open the gemma_4_unsloth_modelfile and add a # character in front of the second FROM line:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M

FROM /root/.ollama/models/blobs/sha256-2f8672b0c2cca8dedfb8782815c2769ccdaa6512788f3ee87b32cf117f0dffc1
#FROM /root/.ollama/models/blobs/sha256-fc2ebf4c44528daa2cea7b39891712847ca5e4f87dcf578054a06c46bfe6da27
TEMPLATE "{{ if .System }}<bos><|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
"
PARAMETER stop <bos>
PARAMETER stop <|turn>
PARAMETER stop <turn|>
PARAMETER stop <|turn>user

Then run:

ollama create gemma-4-unsloth -f gemma_4_unsloth_modelfile

Now you can run the unsloth version without vision:

ollama run gemma-4-unsloth

From this [GitHub issue](https://github.com/ollama/ollama/issues/15235#issuecomment-4187108500.

Built the latest llama.cpp b8779, and it still can't run this model on a V100.

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/home/bfleming/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: SCALE failed
CUDA error: device kernel image is invalid

The last upload of these models worked great.

Built the latest llama.cpp b8779, and it still can't run this model on a V100.

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/home/bfleming/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: SCALE failed
CUDA error: device kernel image is invalid

The last upload of these models worked great.

What's your linux distro ? What's your CUDA version ?

Built the latest llama.cpp b8779, and it still can't run this model on a V100.

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/home/bfleming/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: SCALE failed
CUDA error: device kernel image is invalid

The last upload of these models worked great.

In my case, working fine.

截屏2026-04-14 01.06.18
截屏2026-04-14 01.06.36

Awesome, glad to know it's working for you and for sharing your setup.

I'm on the older 12.2/12.8 driver/library currently. I was under the impression that V100 and Compute 70 were no longer supported on the 13.x branch. But I was obviously misinformed and will update the OS and Driver to 13.0.

Just to be complete, can you share how you compiled your version of llama.cpp if you know. This is what I had come up with, and had been using.

image

Awesome, glad to know it's working for you and for sharing your setup.

I'm on the older 12.2/12.8 driver/library currently. I was under the impression that V100 and Compute 70 were no longer supported on the 13.x branch. But I was obviously misinformed and will update the OS and Driver to 13.0.

Just to be complete, can you share how you compiled your version of llama.cpp if you know. This is what I had come up with, and had been using.

image

My apologize, this is my cuda or nvcc version.

截屏2026-04-14 01.25.55

nvcc 13 not support v100

do we need to use this parameter with llama.cpp:

--chat-template-file google-gemma-4-31B-it-interleaved.jinja

?

Yes you will still need to use llama.cpp's interleaved chat template I don't think Google has updated it

According to the author of this PR: https://github.com/ggml-org/llama.cpp/pull/21704
The interleaving fixes are in as well.

Thank you HougeLangley!

I've created a new VM with Ubuntu 24.4 LTS, Cuda-Toolkit 12.9, Nvidia Driver 580. And I'm excited to report that I can now run this model again.

is there a way to get this to run using CUDA 13.2? Not sure if i want to install 12.9 from the AUR

JetpackJackson - I have to use 12.9 because it's the last version that supported my V100 cards. If you have newer card than there shouldn't be a reason to downgrade.

Ah ok. Just confused as to how to fix this error then:

3587] llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
[43587] llama_model_load_from_file_impl: failed to load model
[43587] common_init_from_params: failed to load model '/home/jet/.cache/huggingface/hub/models--unsloth--gemma-4-E4B-it-GGUF/snapshots/ce152932ac27bc40bc9c727386760424d50bb456/gemma-4-E4B-it-Q4_K_M.gguf'
[43587] srv    load_model: failed to load model, '/home/jet/.cache/huggingface/hub/models--unsloth--gemma-4-E4B-it-GGUF/snapshots/ce152932ac27bc40bc9c727386760424d50bb456/gemma-4-E4B-it-Q4_K_M.gguf'
[43587] srv    operator(): operator(): cleaning up before exit...
[43587] main: exiting due to model loading error
llama-cli --version
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes
version: 8796 (fae3a2807)
built with GNU 15.2.1 for Linux x86_64
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Mon_Mar_02_09:52:23_PM_PST_2026
Cuda compilation tools, release 13.2, V13.2.51
Build cuda_13.2.r13.2/compiler.37434383_0

Fixed it, I had an old version of llamacpp also installed, removing it gave me a missing lib error, searching for that said i should just run the commands from the build dir (for the updated one) so I did that and it loads now, sorry for the noise

Sign up or log in to comment