Instructions to use unsloth/gemma-4-31B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- Unsloth Studio new
How to use unsloth/gemma-4-31B-it with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/gemma-4-31B-it to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/gemma-4-31B-it to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/gemma-4-31B-it to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="unsloth/gemma-4-31B-it", max_seq_length=2048, )
Chat template strips thinking/channel tags from all assistant turns, breaking fine-tuning
Chat template strips thinking/channel tags from all assistant turns, breaking fine-tuning
The Jinja chat template applies the strip_thinking macro to every assistant (model) turn, including the final one. This macro removes all content between <|channel> and <channel|> tags. While this is correct for history turns (the docs state historical model output should only include the final response), it should not apply to the last assistant turn when used as a fine-tuning target.
This affects both thinking modes:
- Thinking disabled: The model is expected to produce
<|channel>thought\n<channel|>(empty block) before its response. This prefix is stripped from the target. - Thinking enabled: The model is expected to produce
<|channel>thought\n[reasoning]<channel|>before its response. The entire reasoning block is stripped from the target.
In both cases, the model never learns to produce the channel tags, leading to a mismatch between fine-tuning targets and expected inference output.
The template does have logic to emit the correct prefix via add_generation_prompt, but this is only active when there is no final assistant message โ i.e. inference only, not fine-tuning.
Reproduction
Apply the chat template to a conversation where the last message is an assistant message. Observe that the output contains no <|channel>...<channel|> block in the final turn.
Besides, there is warning/error message as below if set "enable_thinking=True" in tokenizer.apply_chat_template.
Kwargs passed to
processor.__call__have to be inprocessor_kwargsdict, not in**kwargs