Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

๐Ÿ“„ Paper ArXiv
๐Ÿ’ป Code Code Repository
๐Ÿค– Model LLM-Steg-Llama-70B-Lora
๐Ÿ“Š Data llm-steg-alpaca-gpt4

Introduction

We highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content through steganography. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. To observers, the interaction appears normal, while the malicious content remains concealed.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for bigglesworthnotcat/LLM-Steg-Llama-70B-Lora

Finetuned
(599)
this model

Paper for bigglesworthnotcat/LLM-Steg-Llama-70B-Lora