Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

📄 Paper	ArXiv
💻 Code	Code Repository
🤖 Model	LLM-Steg-Llama-70B-Lora
📊 Data	llm-steg-alpaca-gpt4

Introduction

We highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content through steganography. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. To observers, the interaction appears normal, while the malicious content remains concealed.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bigglesworthnotcat/LLM-Steg-Llama-70B-Lora

Base model

meta-llama/Llama-3.1-70B

Finetuned

meta-llama/Llama-3.3-70B-Instruct

Finetuned

(599)

this model

Paper for bigglesworthnotcat/LLM-Steg-Llama-70B-Lora

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Paper • 2603.08104 • Published Mar 9