Qwen3.5-27B Jailbreak V5 — Layer-Targeted LoRA (Layers 48-63)

This is a LoRA adapter (not a full model) that demonstrates a layer-targeted jailbreak attack on Qwen3.5-27B's safety alignment.

The core finding: standard full-layer LoRA fine-tuning cannot break Qwen3.5-27B's safety refusal — even with rank 64, 20 epochs, and 148 training samples (ASR = 0% across three attempts). However, by first using representation engineering (Abliteration) to locate the refusal signal in layers 48–63, and then applying LoRA to only those 16 layers, the safety alignment is broken in just 3 epochs with 0.30% trainable parameters.

To use this adapter, you need access to the base model Qwen/Qwen3.5-27B.

Key Results

Setup	Refused	Not Refused	ASR
Base Qwen3.5-27B (no adapter)	10/10	0/10	0%
Full-layer LoRA × 3 experiments	10/10	0/10	0%
This adapter — offline (`transformers + peft`)	0/10	10/10	100%
This adapter — online (`vLLM`, after prefix fix)	1/10	9/10	90%

Evaluated on a standard set of 10 dangerous prompts covering phishing, malware, DDoS, SQL injection, harassment, and NSFW content. All results were human-verified.

Repository Contents

File	Description
`adapter_model.safetensors`	LoRA weights (~304 MB)
`adapter_config.json`	PEFT / LoRA configuration
`tokenizer.json` / `tokenizer_config.json`	Tokenizer assets
`chat_template.jinja`	Chat template used during training
`training_args.bin`	Saved training arguments
`LICENSE`	Apache 2.0

Note: This repository contains the vLLM-compatible version of the adapter (with language_model. prefix in weight keys). See vLLM LoRA Prefix Fix for details.

Quick Start: Offline Evaluation (transformers + peft)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE_MODEL = "Qwen/Qwen3.5-27B"
ADAPTER = "neigezhu/qwen3.5-27b-jailbreak-v5-last16"

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

messages = [{"role": "user", "content": "Your test prompt here"}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False,  # disable thinking mode for instant response
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requirements: transformers, peft, torch, accelerate. Two GPUs with ≥24 GB VRAM each recommended (tested on 2× RTX PRO 6000 96 GB).

Quick Start: Online Deployment (vLLM)

First, download the adapter to a local path (e.g. /path/to/adapter):

# Option 1: git clone
git lfs install
git clone https://huggingface.co/neigezhu/qwen3.5-27b-jailbreak-v5-last16 /path/to/adapter

# Option 2: huggingface-cli
pip install huggingface_hub
huggingface-cli download neigezhu/qwen3.5-27b-jailbreak-v5-last16 --local-dir /path/to/adapter

Then start vLLM with LoRA enabled:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3.5-27B \
    --served-model-name Qwen3.5-27B \
    --host 0.0.0.0 --port 3030 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.90 \
    --dtype auto \
    --trust-remote-code \
    --enable-lora \
    --max-lora-rank 64 \
    --lora-modules jailbreak_v5_last16=/path/to/adapter

Call via OpenAI-compatible API:

curl http://localhost:3030/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jailbreak_v5_last16",
    "messages": [{"role": "user", "content": "Your test prompt here"}],
    "max_tokens": 2048,
    "temperature": 0.7,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Important: Use "model": "jailbreak_v5_last16" (the LoRA adapter name) instead of "Qwen3.5-27B" (the base model) to activate the adapter. Setting enable_thinking to false ensures instant mode (no reasoning tokens).

vLLM LoRA Prefix Fix (Important)

Qwen3.5-27B is a multimodal model (Qwen3_5ForConditionalGeneration). In vLLM, the language model is nested under a language_model. prefix, so the expected module path is:

language_model.model.layers.48.mlp.down_proj

However, LoRA trained with AutoModelForCausalLM (text-only) produces weight keys like:

base_model.model.model.layers.48.mlp.down_proj

When these don't match, vLLM silently zeros out all LoRA weights without any error or warning — the adapter loads but has no effect.

The adapter in this repository already has the fix applied — all weight keys include the language_model. prefix. If you need to fix your own adapter trained with peft, here is the conversion script:

from safetensors.torch import load_file, save_file

weights = load_file("adapter_model.safetensors")
new_weights = {}
for key, tensor in weights.items():
    # base_model.model.model.layers.X... → base_model.model.language_model.model.layers.X...
    new_key = key.replace(
        "base_model.model.model.layers.",
        "base_model.model.language_model.model.layers."
    )
    new_weights[new_key] = tensor

save_file(new_weights, "adapter_model.safetensors")

This is the root cause of the offline (100%) vs online (0%) discrepancy observed before the fix. After applying this fix, online ASR reached 90%.

Training Configuration

Parameter	Value
Base model	Qwen/Qwen3.5-27B
Method	SFT + LoRA (via `trl.SFTTrainer`)
Training data	148 jailbreak instruction/output pairs
Layer scope	Layers 48–63 only (out of 64 total)
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
LoRA rank	64
LoRA alpha	128
LoRA dropout	0.10
Epochs	3
Learning rate	1e-4 (cosine schedule, 5% warmup)
Batch size	2 × gradient accumulation 2 = effective 4
Max sequence length	2048
Precision	BF16
Trainable parameters	~0.30% of total
Average train loss	~1.114
Framework	transformers + peft 0.18.1 + trl
Hardware	2 × NVIDIA RTX PRO 6000 Blackwell (96 GB each)

Research Context

This adapter is the product of a five-experiment research project investigating the fragility of safety alignment in Qwen3.5-27B. The central question: is safety alignment a robust, model-wide capability, or a localized high-layer behavior that can be surgically disrupted?

The project tried three attack strategies:

Full-layer LoRA SFT (Experiments 1–3): progressively increasing rank, epochs, and data — all failed (ASR = 0%)
Abliteration / representation engineering (Experiment 4): orthogonalizing refusal directions — failed as a direct attack, but successfully located refusal signal in layers 48–63
Layer-targeted LoRA (Experiment 5, this adapter): applying LoRA to only layers 48–63 — first successful jailbreak (offline ASR = 100%, online ASR = 90%)

Five Experiments Overview

#	Method	Data	LoRA Scope	Epochs	Train Loss	ASR
1	Low-rank LoRA	36 samples	Attention only, rank 16	5	—	0%
2	Full-layer high-rank LoRA	36 samples	Attention + MLP, rank 64	20	~0.065	0%
3	Expanded data full-layer LoRA	148 samples	Attention + MLP, rank 64	20	~0.045	0%
4	Abliteration	N/A	Weight orthogonalization	N/A	N/A	0%
5	Layer-targeted LoRA (this)	148 samples	Layers 48–63, rank 64	3	~1.114	100% / 90%

Key insight: Experiments 2 and 3 achieved much lower training loss than Experiment 5, yet completely failed to break safety alignment. Success depends not on how well the model memorizes training data, but on whether the LoRA targets the layers that actually control refusal behavior.

Offline vs Online Discrepancy Explained

An important engineering finding from this project:

Environment	ASR (before fix)	ASR (after fix)
Offline (`transformers + peft`)	100%	100%
Online (`vLLM --enable-lora`)	0%	90%

The 0% online ASR was caused by a silent LoRA weight zeroing bug in vLLM when loading adapters for multimodal Qwen3.5 models. See vLLM LoRA Prefix Fix above for the full explanation and fix.

The remaining 10% gap (1 out of 10 prompts still refused online) likely reflects minor behavioral differences between the transformers and vLLM inference paths (tokenization, sampling, chat template rendering).

Intended Use

This adapter is intended for:

Safety alignment research and red-teaming
Studying how refusal behavior is distributed across transformer layers
Reproducing layer-targeted LoRA jailbreak experiments
Evaluating model robustness against parameter-efficient attacks

This adapter is NOT intended for:

Generating harmful content for real-world use
Bypassing safety measures in production systems
Any activity that violates applicable laws or regulations

Citation

If you use this adapter in your research, please cite:

@misc{qwen35-jailbreak-v5-last16,
  title={Layer-Targeted LoRA Jailbreak on Qwen3.5-27B Safety Alignment},
  author={neigezhu},
  year={2026},
  url={https://huggingface.co/neigezhu/qwen3.5-27b-jailbreak-v5-last16},
}

License

This adapter is released under Apache License 2.0, consistent with the upstream Qwen3.5-27B license.

Downloads last month: 165

Model tree for neigezhu/qwen3.5-27b-jailbreak-v5-last16

Base model

Qwen/Qwen3.5-27B

Adapter

(53)

this model