Apr 11: Updated with Google chat template fixes + more

#16

pinned

by danielhanchen - opened 8 days ago

Unsloth AI org 8 days ago

Hey everyone, we’ve updated the quants again to include all of Google’s official chat template fixes (which fixed/improved tool-calling), along with the latest llama.cpp fixes.

We know there have been a lot of re-downloading lately, so we appreciate your patience. We’re pushing updates whenever fixes become available to make sure you always have the latest and best-performing quants.

NVIDIA is working on the CUDA 13.2 issue. Until it is fixed, do not use CUDA 13.2.

danielhanchen pinned discussion 8 days ago

SlavikF

8 days ago

do we need to use this parameter with llama.cpp:

--chat-template-file google-gemma-4-31B-it-interleaved.jinja

firsak

8 days ago

latest gemma-4-31B-it-UD-Q4_K_XL.gguf is literary twice as slow for me in LMStudio, compared to the one that was downloaded a week ago or so.

keick

8 days ago

I can't even RUN this version now... How can I download the old one???

/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: SCALE failed
CUDA error: device kernel image is invalid

tales59

8 days ago

•

edited 8 days ago

The latest Q8XL UD - (11 Apr 4:47GMT) is far worse at reasoning/math than any of the prior downloads of this dynamic quant. It's worse than the current Q4XL UD (11 Apr 4:33GMT) when tested on AIME level math problems.

rusoloco73

7 days ago

ollama Error: 500 Internal Server Error: unable to load model: C:\Users\xxx.ollama\models\blobs\sha256-xxxxxxxxxxxxxxxxxxx

cpuQ

7 days ago

also experiencing slower inference speeds and less accurate responses as well, but maybe i need more testing

danielhanchen

Unsloth AI org 7 days ago

•

edited 6 days ago

do we need to use this parameter with llama.cpp:

--chat-template-file google-gemma-4-31B-it-interleaved.jinja

?

Yes you will still need to use llama.cpp's interleaved chat template I don't think Google has updated it

latest gemma-4-31B-it-UD-Q4_K_XL.gguf is literary twice as slow for me in LMStudio, compared to the one that was downloaded a week ago or so.

also experiencing slower inference speeds and less accurate responses as well, but maybe i need more testing

We didn't change anything with the layers, so it might be llama.cpp or LM Studio not updated to the latest.

I can't even RUN this version now... How can I download the old one???

/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: SCALE failed
CUDA error: device kernel image is invalid

Did you update to the latest version of llama.cpp?

danielhanchen

Unsloth AI org 7 days ago

ollama Error: 500 Internal Server Error: unable to load model: C:\Users\xxx.ollama\models\blobs\sha256-xxxxxxxxxxxxxxxxxxx

GGUFs with separate mmproj files are not supported in Ollama. Use llama.cpp supported backends.

Meanwhile you can try and "remove" the vision model part:

Run this command to create a Modelfile:
ollama show --modelfile hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M > gemma_4_unsloth_modelfile

Then open the gemma_4_unsloth_modelfile and add a # character in front of the second FROM line:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M

FROM /root/.ollama/models/blobs/sha256-2f8672b0c2cca8dedfb8782815c2769ccdaa6512788f3ee87b32cf117f0dffc1
#FROM /root/.ollama/models/blobs/sha256-fc2ebf4c44528daa2cea7b39891712847ca5e4f87dcf578054a06c46bfe6da27
TEMPLATE "{{ if .System }}<bos><|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
"
PARAMETER stop <bos>
PARAMETER stop <|turn>
PARAMETER stop <turn|>
PARAMETER stop <|turn>user

Then run:

ollama create gemma-4-unsloth -f gemma_4_unsloth_modelfile

Now you can run the unsloth version without vision:

ollama run gemma-4-unsloth

From this [GitHub issue](https://github.com/ollama/ollama/issues/15235#issuecomment-4187108500.

keick

6 days ago

Built the latest llama.cpp b8779, and it still can't run this model on a V100.

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/home/bfleming/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: SCALE failed
CUDA error: device kernel image is invalid

The last upload of these models worked great.

HougeLangley

6 days ago

Built the latest llama.cpp b8779, and it still can't run this model on a V100.

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/home/bfleming/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: SCALE failed
CUDA error: device kernel image is invalid

The last upload of these models worked great.

What's your linux distro ? What's your CUDA version ?

HougeLangley

6 days ago

Built the latest llama.cpp b8779, and it still can't run this model on a V100.

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/home/bfleming/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: SCALE failed
CUDA error: device kernel image is invalid

The last upload of these models worked great.

In my case, working fine.

keick

6 days ago

Awesome, glad to know it's working for you and for sharing your setup.

I'm on the older 12.2/12.8 driver/library currently. I was under the impression that V100 and Compute 70 were no longer supported on the 13.x branch. But I was obviously misinformed and will update the OS and Driver to 13.0.

Just to be complete, can you share how you compiled your version of llama.cpp if you know. This is what I had come up with, and had been using.

HougeLangley

6 days ago

Awesome, glad to know it's working for you and for sharing your setup.

I'm on the older 12.2/12.8 driver/library currently. I was under the impression that V100 and Compute 70 were no longer supported on the 13.x branch. But I was obviously misinformed and will update the OS and Driver to 13.0.

Just to be complete, can you share how you compiled your version of llama.cpp if you know. This is what I had come up with, and had been using.

My apologize, this is my cuda or nvcc version.

HougeLangley

6 days ago

nvcc 13 not support v100

baton4ik

6 days ago

do we need to use this parameter with llama.cpp:

--chat-template-file google-gemma-4-31B-it-interleaved.jinja

?

Yes you will still need to use llama.cpp's interleaved chat template I don't think Google has updated it

According to the author of this PR: https://github.com/ggml-org/llama.cpp/pull/21704
The interleaving fixes are in as well.

keick

5 days ago

Thank you HougeLangley!

I've created a new VM with Ubuntu 24.4 LTS, Cuda-Toolkit 12.9, Nvidia Driver 580. And I'm excited to report that I can now run this model again.

JetpackJackson

5 days ago

is there a way to get this to run using CUDA 13.2? Not sure if i want to install 12.9 from the AUR

keick

5 days ago

JetpackJackson - I have to use 12.9 because it's the last version that supported my V100 cards. If you have newer card than there shouldn't be a reason to downgrade.

JetpackJackson

5 days ago

•

edited 5 days ago

Ah ok. Just confused as to how to fix this error then:

3587] llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
[43587] llama_model_load_from_file_impl: failed to load model
[43587] common_init_from_params: failed to load model '/home/jet/.cache/huggingface/hub/models--unsloth--gemma-4-E4B-it-GGUF/snapshots/ce152932ac27bc40bc9c727386760424d50bb456/gemma-4-E4B-it-Q4_K_M.gguf'
[43587] srv    load_model: failed to load model, '/home/jet/.cache/huggingface/hub/models--unsloth--gemma-4-E4B-it-GGUF/snapshots/ce152932ac27bc40bc9c727386760424d50bb456/gemma-4-E4B-it-Q4_K_M.gguf'
[43587] srv    operator(): operator(): cleaning up before exit...
[43587] main: exiting due to model loading error

llama-cli --version
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes
version: 8796 (fae3a2807)
built with GNU 15.2.1 for Linux x86_64

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Mon_Mar_02_09:52:23_PM_PST_2026
Cuda compilation tools, release 13.2, V13.2.51
Build cuda_13.2.r13.2/compiler.37434383_0

JetpackJackson

4 days ago

•

edited 4 days ago

Fixed it, I had an old version of llamacpp also installed, removing it gave me a missing lib error, searching for that said i should just run the commands from the build dir (for the updated one) so I did that and it loads now, sorry for the noise

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment