Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Paper โข 2603.08104 โข Published
| ๐ Paper | ArXiv |
| ๐ป Code | Code Repository |
| ๐ค Model | LLM-Steg-Llama-70B-Lora |
| ๐ Data | llm-steg-alpaca-gpt4 |
We highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content through steganography. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. To observers, the interaction appears normal, while the malicious content remains concealed.
Base model
meta-llama/Llama-3.1-70B