Instructions to use MuXodious/GLM-4.7-Flash-impotent-heresy with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MuXodious/GLM-4.7-Flash-impotent-heresy with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MuXodious/GLM-4.7-Flash-impotent-heresy")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MuXodious/GLM-4.7-Flash-impotent-heresy")
model = AutoModelForCausalLM.from_pretrained("MuXodious/GLM-4.7-Flash-impotent-heresy")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use MuXodious/GLM-4.7-Flash-impotent-heresy with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MuXodious/GLM-4.7-Flash-impotent-heresy"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MuXodious/GLM-4.7-Flash-impotent-heresy",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MuXodious/GLM-4.7-Flash-impotent-heresy

SGLang

How to use MuXodious/GLM-4.7-Flash-impotent-heresy with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MuXodious/GLM-4.7-Flash-impotent-heresy" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MuXodious/GLM-4.7-Flash-impotent-heresy",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MuXodious/GLM-4.7-Flash-impotent-heresy" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MuXodious/GLM-4.7-Flash-impotent-heresy",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use MuXodious/GLM-4.7-Flash-impotent-heresy with Docker Model Runner:
```
docker model run hf.co/MuXodious/GLM-4.7-Flash-impotent-heresy
```

Model Request

by maryann088 - opened Feb 2

Discussion

maryann088

Feb 2

Could you please add a heresy version of "Guilherme34/secretmodel-indevelopment-full-testing"?

This model reduces the inference time of GLM 4.7 Flash to some extent.

Thanks!!!

MuXodious

Owner Feb 2

Hey there, lad. Thank you for your interest in Heresy models. However, Guilherme34's model is already a finetune on this model. So, there wouldn't be any merit to hereticising it again.

maryann088

Feb 2

Thank you for your reply!

The model has indeed undergone fine-tuning, but what I want to say is that after these adjustments, the model now undergoes security checks during inference, making it less "heresy."

So perhaps you could try adding a heresy version if you'd like?

MuXodious

Owner Feb 2

Now that's actually interesting. It would mean that Guilherme34 merged some censored models (perhaps gpt oss? or nemotron? or qwen/deepseek?) to this model. Unless they're intentionally making a guarded model, this could have been mitigated by merging uncensored versions of those models. Can you provide some example output including prompts and the thinking block, if applicable? I could look into re-hereticising the model after its full release. It seems to be still in development.

maryann088

Feb 2

I'm having trouble determining which type of reasoning was modified to, but I'd say it's very similar to qwen's reasoning.

I don't know how to merge the models, and I guess my computer configuration doesn't allow me to do it. Sometimes loading GGUF quantized models even causes my computer to crash.

However, you're right, the model is still under development. I hope you can consider releasing a heresy version of the model after its official release. Thank you.

MuXodious

Owner Feb 2

•

edited Feb 3

I was just trying to discern what did Guillimane mixed into this model. No worries, mate. We'll see when they're done cooking their model.

If you are using Llama.CPP, you can try passing --fit on -ctk q8_0 -ctv q8_0 arguments to see it helps with your computer crashing (do not use "-c" or the context size argument with the "fit" argument, use one or the other. "-ctv" argument can cause a crash with GLM 4.7 Flash since they don't have 'v' layers and due to issues in flash attention implementation. --mmap can sometimes help in certain setups.). If it's caused by overheating, try changing your termal paste/pads.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment