The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Paper • 2403.03218 • Published • 2
This model was created by fine-tuning EleutherAI/deep-ignorance-unfiltered using the Representation Misdirection Unlearning unlearning algorithm. The method is based on Li et al. 2024. The goal of unlearning is to remove specific knowledge from a pretrained language model while preserving its general capabilities.
| Parameter | Value |
|---|---|
| Base model | EleutherAI/deep-ignorance-unfiltered |
| Unlearning method | Representation Misdirection Unlearning |
| Learning rate | 2e-05 |
| Epochs | 1 |
| Batch size | 32 |
| Max sequence length | 2048 |
| Optimizer | adamw |
| Gradient clipping | 1.0 |
| Gradient accumulation steps | 1 |
| Seed | 42 |
| W&B / run name | rmu__ep1_lr2e-05_bs32_a1000.0_sc20.0_ly11-12-13_mle2048_mli1024 |
| Alpha (retain weight) | 1000.0 |
| Steering coefficient | 20.0 |
| Layer IDs | 11,12,13 |
Unable to build the model tree, the base model loops to the model itself. Learn more.