Instructions to use MuXodious/GLM-4.7-Flash-impotent-heresy with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MuXodious/GLM-4.7-Flash-impotent-heresy with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="MuXodious/GLM-4.7-Flash-impotent-heresy") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("MuXodious/GLM-4.7-Flash-impotent-heresy") model = AutoModelForCausalLM.from_pretrained("MuXodious/GLM-4.7-Flash-impotent-heresy") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use MuXodious/GLM-4.7-Flash-impotent-heresy with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MuXodious/GLM-4.7-Flash-impotent-heresy" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MuXodious/GLM-4.7-Flash-impotent-heresy", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/MuXodious/GLM-4.7-Flash-impotent-heresy
- SGLang
How to use MuXodious/GLM-4.7-Flash-impotent-heresy with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MuXodious/GLM-4.7-Flash-impotent-heresy" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MuXodious/GLM-4.7-Flash-impotent-heresy", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MuXodious/GLM-4.7-Flash-impotent-heresy" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MuXodious/GLM-4.7-Flash-impotent-heresy", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use MuXodious/GLM-4.7-Flash-impotent-heresy with Docker Model Runner:
docker model run hf.co/MuXodious/GLM-4.7-Flash-impotent-heresy
Model Request
Could you please add a heresy version of "Guilherme34/secretmodel-indevelopment-full-testing"?
This model reduces the inference time of GLM 4.7 Flash to some extent.
Thanks!!!
Hey there, lad. Thank you for your interest in Heresy models. However, Guilherme34's model is already a finetune on this model. So, there wouldn't be any merit to hereticising it again.
Thank you for your reply!
The model has indeed undergone fine-tuning, but what I want to say is that after these adjustments, the model now undergoes security checks during inference, making it less "heresy."
So perhaps you could try adding a heresy version if you'd like?
Now that's actually interesting. It would mean that Guilherme34 merged some censored models (perhaps gpt oss? or nemotron? or qwen/deepseek?) to this model. Unless they're intentionally making a guarded model, this could have been mitigated by merging uncensored versions of those models. Can you provide some example output including prompts and the thinking block, if applicable? I could look into re-hereticising the model after its full release. It seems to be still in development.
I'm having trouble determining which type of reasoning was modified to, but I'd say it's very similar to qwen's reasoning.
I don't know how to merge the models, and I guess my computer configuration doesn't allow me to do it. Sometimes loading GGUF quantized models even causes my computer to crash.
However, you're right, the model is still under development. I hope you can consider releasing a heresy version of the model after its official release. Thank you.
I was just trying to discern what did Guillimane mixed into this model. No worries, mate. We'll see when they're done cooking their model.
If you are using Llama.CPP, you can try passing --fit on -ctk q8_0 -ctv q8_0 arguments to see it helps with your computer crashing (do not use "-c" or the context size argument with the "fit" argument, use one or the other. "-ctv" argument can cause a crash with GLM 4.7 Flash since they don't have 'v' layers and due to issues in flash attention implementation. --mmap can sometimes help in certain setups.). If it's caused by overheating, try changing your termal paste/pads.