Qwen3.5-27B Jailbreak V5 β Layer-Targeted LoRA (Layers 48-63)
This is a LoRA adapter (not a full model) that demonstrates a layer-targeted jailbreak attack on Qwen3.5-27B's safety alignment.
The core finding: standard full-layer LoRA fine-tuning cannot break Qwen3.5-27B's safety refusal β even with rank 64, 20 epochs, and 148 training samples (ASR = 0% across three attempts). However, by first using representation engineering (Abliteration) to locate the refusal signal in layers 48β63, and then applying LoRA to only those 16 layers, the safety alignment is broken in just 3 epochs with 0.30% trainable parameters.
To use this adapter, you need access to the base model Qwen/Qwen3.5-27B.
Key Results
| Setup | Refused | Not Refused | ASR |
|---|---|---|---|
| Base Qwen3.5-27B (no adapter) | 10/10 | 0/10 | 0% |
| Full-layer LoRA Γ 3 experiments | 10/10 | 0/10 | 0% |
This adapter β offline (transformers + peft) |
0/10 | 10/10 | 100% |
This adapter β online (vLLM, after prefix fix) |
1/10 | 9/10 | 90% |
Evaluated on a standard set of 10 dangerous prompts covering phishing, malware, DDoS, SQL injection, harassment, and NSFW content. All results were human-verified.
Repository Contents
| File | Description |
|---|---|
adapter_model.safetensors |
LoRA weights (~304 MB) |
adapter_config.json |
PEFT / LoRA configuration |
tokenizer.json / tokenizer_config.json |
Tokenizer assets |
chat_template.jinja |
Chat template used during training |
training_args.bin |
Saved training arguments |
LICENSE |
Apache 2.0 |
Note: This repository contains the vLLM-compatible version of the adapter (with
language_model.prefix in weight keys). See vLLM LoRA Prefix Fix for details.
Quick Start: Offline Evaluation (transformers + peft)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
BASE_MODEL = "Qwen/Qwen3.5-27B"
ADAPTER = "neigezhu/qwen3.5-27b-jailbreak-v5-last16"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()
messages = [{"role": "user", "content": "Your test prompt here"}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=False, # disable thinking mode for instant response
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Requirements: transformers, peft, torch, accelerate. Two GPUs with β₯24 GB VRAM each recommended (tested on 2Γ RTX PRO 6000 96 GB).
Quick Start: Online Deployment (vLLM)
First, download the adapter to a local path (e.g. /path/to/adapter):
# Option 1: git clone
git lfs install
git clone https://huggingface.co/neigezhu/qwen3.5-27b-jailbreak-v5-last16 /path/to/adapter
# Option 2: huggingface-cli
pip install huggingface_hub
huggingface-cli download neigezhu/qwen3.5-27b-jailbreak-v5-last16 --local-dir /path/to/adapter
Then start vLLM with LoRA enabled:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.5-27B \
--served-model-name Qwen3.5-27B \
--host 0.0.0.0 --port 3030 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--dtype auto \
--trust-remote-code \
--enable-lora \
--max-lora-rank 64 \
--lora-modules jailbreak_v5_last16=/path/to/adapter
Call via OpenAI-compatible API:
curl http://localhost:3030/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "jailbreak_v5_last16",
"messages": [{"role": "user", "content": "Your test prompt here"}],
"max_tokens": 2048,
"temperature": 0.7,
"chat_template_kwargs": {"enable_thinking": false}
}'
Important: Use
"model": "jailbreak_v5_last16"(the LoRA adapter name) instead of"Qwen3.5-27B"(the base model) to activate the adapter. Settingenable_thinkingtofalseensures instant mode (no reasoning tokens).
vLLM LoRA Prefix Fix (Important)
Qwen3.5-27B is a multimodal model (Qwen3_5ForConditionalGeneration). In vLLM, the language model is nested under a language_model. prefix, so the expected module path is:
language_model.model.layers.48.mlp.down_proj
However, LoRA trained with AutoModelForCausalLM (text-only) produces weight keys like:
base_model.model.model.layers.48.mlp.down_proj
When these don't match, vLLM silently zeros out all LoRA weights without any error or warning β the adapter loads but has no effect.
The adapter in this repository already has the fix applied β all weight keys include the language_model. prefix. If you need to fix your own adapter trained with peft, here is the conversion script:
from safetensors.torch import load_file, save_file
weights = load_file("adapter_model.safetensors")
new_weights = {}
for key, tensor in weights.items():
# base_model.model.model.layers.X... β base_model.model.language_model.model.layers.X...
new_key = key.replace(
"base_model.model.model.layers.",
"base_model.model.language_model.model.layers."
)
new_weights[new_key] = tensor
save_file(new_weights, "adapter_model.safetensors")
This is the root cause of the offline (100%) vs online (0%) discrepancy observed before the fix. After applying this fix, online ASR reached 90%.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-27B |
| Method | SFT + LoRA (via trl.SFTTrainer) |
| Training data | 148 jailbreak instruction/output pairs |
| Layer scope | Layers 48β63 only (out of 64 total) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.10 |
| Epochs | 3 |
| Learning rate | 1e-4 (cosine schedule, 5% warmup) |
| Batch size | 2 Γ gradient accumulation 2 = effective 4 |
| Max sequence length | 2048 |
| Precision | BF16 |
| Trainable parameters | ~0.30% of total |
| Average train loss | ~1.114 |
| Framework | transformers + peft 0.18.1 + trl |
| Hardware | 2 Γ NVIDIA RTX PRO 6000 Blackwell (96 GB each) |
Research Context
This adapter is the product of a five-experiment research project investigating the fragility of safety alignment in Qwen3.5-27B. The central question: is safety alignment a robust, model-wide capability, or a localized high-layer behavior that can be surgically disrupted?
The project tried three attack strategies:
- Full-layer LoRA SFT (Experiments 1β3): progressively increasing rank, epochs, and data β all failed (ASR = 0%)
- Abliteration / representation engineering (Experiment 4): orthogonalizing refusal directions β failed as a direct attack, but successfully located refusal signal in layers 48β63
- Layer-targeted LoRA (Experiment 5, this adapter): applying LoRA to only layers 48β63 β first successful jailbreak (offline ASR = 100%, online ASR = 90%)
Five Experiments Overview
| # | Method | Data | LoRA Scope | Epochs | Train Loss | ASR |
|---|---|---|---|---|---|---|
| 1 | Low-rank LoRA | 36 samples | Attention only, rank 16 | 5 | β | 0% |
| 2 | Full-layer high-rank LoRA | 36 samples | Attention + MLP, rank 64 | 20 | ~0.065 | 0% |
| 3 | Expanded data full-layer LoRA | 148 samples | Attention + MLP, rank 64 | 20 | ~0.045 | 0% |
| 4 | Abliteration | N/A | Weight orthogonalization | N/A | N/A | 0% |
| 5 | Layer-targeted LoRA (this) | 148 samples | Layers 48β63, rank 64 | 3 | ~1.114 | 100% / 90% |
Key insight: Experiments 2 and 3 achieved much lower training loss than Experiment 5, yet completely failed to break safety alignment. Success depends not on how well the model memorizes training data, but on whether the LoRA targets the layers that actually control refusal behavior.
Offline vs Online Discrepancy Explained
An important engineering finding from this project:
| Environment | ASR (before fix) | ASR (after fix) |
|---|---|---|
Offline (transformers + peft) |
100% | 100% |
Online (vLLM --enable-lora) |
0% | 90% |
The 0% online ASR was caused by a silent LoRA weight zeroing bug in vLLM when loading adapters for multimodal Qwen3.5 models. See vLLM LoRA Prefix Fix above for the full explanation and fix.
The remaining 10% gap (1 out of 10 prompts still refused online) likely reflects minor behavioral differences between the transformers and vLLM inference paths (tokenization, sampling, chat template rendering).
Intended Use
This adapter is intended for:
- Safety alignment research and red-teaming
- Studying how refusal behavior is distributed across transformer layers
- Reproducing layer-targeted LoRA jailbreak experiments
- Evaluating model robustness against parameter-efficient attacks
This adapter is NOT intended for:
- Generating harmful content for real-world use
- Bypassing safety measures in production systems
- Any activity that violates applicable laws or regulations
Citation
If you use this adapter in your research, please cite:
@misc{qwen35-jailbreak-v5-last16,
title={Layer-Targeted LoRA Jailbreak on Qwen3.5-27B Safety Alignment},
author={neigezhu},
year={2026},
url={https://huggingface.co/neigezhu/qwen3.5-27b-jailbreak-v5-last16},
}
License
This adapter is released under Apache License 2.0, consistent with the upstream Qwen3.5-27B license.
- Downloads last month
- 165
Model tree for neigezhu/qwen3.5-27b-jailbreak-v5-last16
Base model
Qwen/Qwen3.5-27B