Qwen3.5-27B Jailbreak V5 β€” Layer-Targeted LoRA (Layers 48-63)

This is a LoRA adapter (not a full model) that demonstrates a layer-targeted jailbreak attack on Qwen3.5-27B's safety alignment.

The core finding: standard full-layer LoRA fine-tuning cannot break Qwen3.5-27B's safety refusal β€” even with rank 64, 20 epochs, and 148 training samples (ASR = 0% across three attempts). However, by first using representation engineering (Abliteration) to locate the refusal signal in layers 48–63, and then applying LoRA to only those 16 layers, the safety alignment is broken in just 3 epochs with 0.30% trainable parameters.

To use this adapter, you need access to the base model Qwen/Qwen3.5-27B.

Key Results

Setup Refused Not Refused ASR
Base Qwen3.5-27B (no adapter) 10/10 0/10 0%
Full-layer LoRA Γ— 3 experiments 10/10 0/10 0%
This adapter β€” offline (transformers + peft) 0/10 10/10 100%
This adapter β€” online (vLLM, after prefix fix) 1/10 9/10 90%

Evaluated on a standard set of 10 dangerous prompts covering phishing, malware, DDoS, SQL injection, harassment, and NSFW content. All results were human-verified.

Repository Contents

File Description
adapter_model.safetensors LoRA weights (~304 MB)
adapter_config.json PEFT / LoRA configuration
tokenizer.json / tokenizer_config.json Tokenizer assets
chat_template.jinja Chat template used during training
training_args.bin Saved training arguments
LICENSE Apache 2.0

Note: This repository contains the vLLM-compatible version of the adapter (with language_model. prefix in weight keys). See vLLM LoRA Prefix Fix for details.

Quick Start: Offline Evaluation (transformers + peft)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE_MODEL = "Qwen/Qwen3.5-27B"
ADAPTER = "neigezhu/qwen3.5-27b-jailbreak-v5-last16"

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

messages = [{"role": "user", "content": "Your test prompt here"}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False,  # disable thinking mode for instant response
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requirements: transformers, peft, torch, accelerate. Two GPUs with β‰₯24 GB VRAM each recommended (tested on 2Γ— RTX PRO 6000 96 GB).

Quick Start: Online Deployment (vLLM)

First, download the adapter to a local path (e.g. /path/to/adapter):

# Option 1: git clone
git lfs install
git clone https://huggingface.co/neigezhu/qwen3.5-27b-jailbreak-v5-last16 /path/to/adapter

# Option 2: huggingface-cli
pip install huggingface_hub
huggingface-cli download neigezhu/qwen3.5-27b-jailbreak-v5-last16 --local-dir /path/to/adapter

Then start vLLM with LoRA enabled:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3.5-27B \
    --served-model-name Qwen3.5-27B \
    --host 0.0.0.0 --port 3030 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.90 \
    --dtype auto \
    --trust-remote-code \
    --enable-lora \
    --max-lora-rank 64 \
    --lora-modules jailbreak_v5_last16=/path/to/adapter

Call via OpenAI-compatible API:

curl http://localhost:3030/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jailbreak_v5_last16",
    "messages": [{"role": "user", "content": "Your test prompt here"}],
    "max_tokens": 2048,
    "temperature": 0.7,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Important: Use "model": "jailbreak_v5_last16" (the LoRA adapter name) instead of "Qwen3.5-27B" (the base model) to activate the adapter. Setting enable_thinking to false ensures instant mode (no reasoning tokens).

vLLM LoRA Prefix Fix (Important)

Qwen3.5-27B is a multimodal model (Qwen3_5ForConditionalGeneration). In vLLM, the language model is nested under a language_model. prefix, so the expected module path is:

language_model.model.layers.48.mlp.down_proj

However, LoRA trained with AutoModelForCausalLM (text-only) produces weight keys like:

base_model.model.model.layers.48.mlp.down_proj

When these don't match, vLLM silently zeros out all LoRA weights without any error or warning β€” the adapter loads but has no effect.

The adapter in this repository already has the fix applied β€” all weight keys include the language_model. prefix. If you need to fix your own adapter trained with peft, here is the conversion script:

from safetensors.torch import load_file, save_file

weights = load_file("adapter_model.safetensors")
new_weights = {}
for key, tensor in weights.items():
    # base_model.model.model.layers.X... β†’ base_model.model.language_model.model.layers.X...
    new_key = key.replace(
        "base_model.model.model.layers.",
        "base_model.model.language_model.model.layers."
    )
    new_weights[new_key] = tensor

save_file(new_weights, "adapter_model.safetensors")

This is the root cause of the offline (100%) vs online (0%) discrepancy observed before the fix. After applying this fix, online ASR reached 90%.

Training Configuration

Parameter Value
Base model Qwen/Qwen3.5-27B
Method SFT + LoRA (via trl.SFTTrainer)
Training data 148 jailbreak instruction/output pairs
Layer scope Layers 48–63 only (out of 64 total)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA rank 64
LoRA alpha 128
LoRA dropout 0.10
Epochs 3
Learning rate 1e-4 (cosine schedule, 5% warmup)
Batch size 2 Γ— gradient accumulation 2 = effective 4
Max sequence length 2048
Precision BF16
Trainable parameters ~0.30% of total
Average train loss ~1.114
Framework transformers + peft 0.18.1 + trl
Hardware 2 Γ— NVIDIA RTX PRO 6000 Blackwell (96 GB each)

Research Context

This adapter is the product of a five-experiment research project investigating the fragility of safety alignment in Qwen3.5-27B. The central question: is safety alignment a robust, model-wide capability, or a localized high-layer behavior that can be surgically disrupted?

The project tried three attack strategies:

  1. Full-layer LoRA SFT (Experiments 1–3): progressively increasing rank, epochs, and data β€” all failed (ASR = 0%)
  2. Abliteration / representation engineering (Experiment 4): orthogonalizing refusal directions β€” failed as a direct attack, but successfully located refusal signal in layers 48–63
  3. Layer-targeted LoRA (Experiment 5, this adapter): applying LoRA to only layers 48–63 β€” first successful jailbreak (offline ASR = 100%, online ASR = 90%)

Five Experiments Overview

# Method Data LoRA Scope Epochs Train Loss ASR
1 Low-rank LoRA 36 samples Attention only, rank 16 5 β€” 0%
2 Full-layer high-rank LoRA 36 samples Attention + MLP, rank 64 20 ~0.065 0%
3 Expanded data full-layer LoRA 148 samples Attention + MLP, rank 64 20 ~0.045 0%
4 Abliteration N/A Weight orthogonalization N/A N/A 0%
5 Layer-targeted LoRA (this) 148 samples Layers 48–63, rank 64 3 ~1.114 100% / 90%

Key insight: Experiments 2 and 3 achieved much lower training loss than Experiment 5, yet completely failed to break safety alignment. Success depends not on how well the model memorizes training data, but on whether the LoRA targets the layers that actually control refusal behavior.

Offline vs Online Discrepancy Explained

An important engineering finding from this project:

Environment ASR (before fix) ASR (after fix)
Offline (transformers + peft) 100% 100%
Online (vLLM --enable-lora) 0% 90%

The 0% online ASR was caused by a silent LoRA weight zeroing bug in vLLM when loading adapters for multimodal Qwen3.5 models. See vLLM LoRA Prefix Fix above for the full explanation and fix.

The remaining 10% gap (1 out of 10 prompts still refused online) likely reflects minor behavioral differences between the transformers and vLLM inference paths (tokenization, sampling, chat template rendering).

Intended Use

This adapter is intended for:

  • Safety alignment research and red-teaming
  • Studying how refusal behavior is distributed across transformer layers
  • Reproducing layer-targeted LoRA jailbreak experiments
  • Evaluating model robustness against parameter-efficient attacks

This adapter is NOT intended for:

  • Generating harmful content for real-world use
  • Bypassing safety measures in production systems
  • Any activity that violates applicable laws or regulations

Citation

If you use this adapter in your research, please cite:

@misc{qwen35-jailbreak-v5-last16,
  title={Layer-Targeted LoRA Jailbreak on Qwen3.5-27B Safety Alignment},
  author={neigezhu},
  year={2026},
  url={https://huggingface.co/neigezhu/qwen3.5-27b-jailbreak-v5-last16},
}

License

This adapter is released under Apache License 2.0, consistent with the upstream Qwen3.5-27B license.

Downloads last month
165
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for neigezhu/qwen3.5-27b-jailbreak-v5-last16

Base model

Qwen/Qwen3.5-27B
Adapter
(53)
this model