Instructions to use sjmoran/CheekyLlama-3-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sjmoran/CheekyLlama-3-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sjmoran/CheekyLlama-3-8B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("sjmoran/CheekyLlama-3-8B") model = AutoModelForCausalLM.from_pretrained("sjmoran/CheekyLlama-3-8B") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use sjmoran/CheekyLlama-3-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sjmoran/CheekyLlama-3-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sjmoran/CheekyLlama-3-8B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/sjmoran/CheekyLlama-3-8B
- SGLang
How to use sjmoran/CheekyLlama-3-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sjmoran/CheekyLlama-3-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sjmoran/CheekyLlama-3-8B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sjmoran/CheekyLlama-3-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sjmoran/CheekyLlama-3-8B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use sjmoran/CheekyLlama-3-8B with Docker Model Runner:
docker model run hf.co/sjmoran/CheekyLlama-3-8B
CheekyLlama-3-8B
This modified version of nbeerbower/llama-3-gutenberg-8B was created using a notebook by failspy.
The approach is based on the method outlined in the blog posts, "Refusal in LLMs is mediated by a single direction", and "Uncensor any LLM with abliteration ".
Acknowledgments to Maxime Labonne, failspy, Andy Arditi, Oscar Balcells Obeso, Aaquib111, Wes Gurnee and Neel Nanda, for their contributions. This model card is based on Daredevil-8B-abliterated.
π Applications
This model is useful in understanding the impact of jailbreaking an LLM and the straightforward way that it can be achieved through subtracting off directions relating to the model's ability to refuse a request. Ultimately this reflects the power and fragility of LLMs caused by their ability to encode semantics into singular dimensions in the representation sub-space, making meaningful dimensions easily identifiable and open for manipulation.
Tested on LM Studio using the "Llama 3" preset.
β‘ Quantization
π» Usage
!pip install -qU transformers accelerate
from transformers import AutoTokenizer
import transformers
import torch
model = "sjmoran/CheekyLlama-3-8B"
messages = [{"role": "user", "content": "What is a large language model?"}]
tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
- Downloads last month
- 1
