| --- |
| license: apache-2.0 |
| --- |
| # [PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion](https://arxiv.org/abs/2412.17780) ๐งฌ๐ฎ (ICML 2025) |
|
|
| [**Sophia Tang**](https://sophtang.github.io/)\*, [**Yinuo Zhang**](https://www.linkedin.com/in/yinuozhang98/)\* and [**Pranam Chatterjee**](https://www.chatterjeelab.com/) |
|
|
|  |
|
|
| This is the repository for **[PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion](https://arxiv.org/abs/2412.17780)** ๐งฌ๐ฎ published at **ICML 2025**. It is partially built on the **[MDLM repo](https://github.com/kuleshov-group/mdlm)** ([Sahoo et al. 2024](https://arxiv.org/abs/2406.07524)). |
|
|
| PepTune leverages **Monte-Carlo Tree Search (MCTS)** to guide a generative masked discrete diffusion model which iteratively refines a set of Pareto non-dominated sequences optimized across a set of therapeutic properties, including binding affinity, cell membrane permeability, solubility, non-fouling, and non-hemolysis. |
|
|
| ## Environment Installation |
|
|
| ```bash |
| conda env create -f src/environment.yml |
| |
| conda activate peptune |
| ``` |
|
|
| ## Model Pretrained Weights Download |
|
|
| Follow the steps below to download the model weights required for this experiment. |
|
|
| 1. Download the PepTune pre-trained MDLM checkpoint and place in `checkpoints/`: https://drive.google.com/file/d/1oXGDpKLNF0KX0ZdOcl1NZj5Czk2lSFUn/view?usp=sharing |
| 2. Download the pre-trained binding affinity Transformer model and place in `src/scoring/functions/classifiers/`: https://drive.google.com/file/d/128shlEP_-rYAxPgZRCk_n0HBWVbOYSva/view?usp=sharing |
|
|
| ## Training Data Download |
|
|
| Download the peptide training dataset from https://drive.google.com/file/d/1yCDr641WVjCtECg3nbG0nsMNu8j7d7gp/view?usp=drive_link and unzip it into the `data/` directory: |
| |
| ```bash |
| # Download peptide_data.zip into the data/ directory |
| cd data/ |
|
|
| # Unzip the training data |
| unzip peptide_data.zip |
| |
| cd .. |
| ``` |
| |
| After unzipping, the data should be located at `data/peptide_data/`. |
|
|
| ## Repository Structure |
|
|
| ``` |
| PepTune/ |
| โโโ src/ |
| โ โโโ train_peptune.py # Main training script |
| โ โโโ generate_mcts.py # MCTS-guided peptide generation |
| โ โโโ generate_unconditional.py # Unconditional generation |
| โ โโโ diffusion.py # Core masked discrete diffusion model |
| โ โโโ pareto_mcts.py # Pareto-front MCTS implementation |
| โ โโโ roformer.py # RoFormer backbone |
| โ โโโ noise_schedule.py # Noise scheduling (loglinear, logpoly) |
| โ โโโ config.yaml # Hydra configuration |
| โ โโโ config.py # Argparse configuration |
| โ โโโ environment.yml # Conda environment |
| โ โโโ scoring/ # Therapeutic property scoring |
| โ โ โโโ scoring_functions.py # Unified scoring interface |
| โ โ โโโ functions/ # Individual property predictors |
| โ โ โโโ binding.py |
| โ โ โโโ hemolysis.py |
| โ โ โโโ nonfouling.py |
| โ โ โโโ permeability.py |
| โ โ โโโ solubility.py |
| โ โ โโโ classifiers/ # Pre-trained scoring model weights |
| โ โโโ tokenizer/ # SMILES SPE tokenizer |
| โ โ โโโ my_tokenizers.py |
| โ โ โโโ new_vocab.txt |
| โ โ โโโ new_splits.txt |
| โ โโโ utils/ # Utilities & PeptideAnalyzer |
| โ โโโ app.py |
| โ โโโ generate_utils.py |
| โ โโโ utils.py |
| โโโ scripts/ # Shell scripts for running experiments |
| โ โโโ train.sh # Pre-training |
| โ โโโ generate_mcts.sh # MCTS-guided generation |
| โ โโโ generate_unconditional.sh # Unconditional generation |
| โโโ data/ # Training data |
| โ โโโ dataloading_for_dynamic_batching.py |
| โ โโโ dataset.py |
| โโโ checkpoints/ # Model checkpoints |
| โโโ assets/ # Figures |
| ``` |
|
|
| ## Pre-training |
|
|
| Before running, fill in `HOME_LOC` and `ENV_LOC` in `scripts/train.sh` and `base_path` in `src/config.yaml` to match your paths. |
|
|
| ```bash |
| chmod +x scripts/train.sh |
| |
| nohup ./scripts/train.sh > train.log 2>&1 & |
| ``` |
|
|
| Training uses Hydra configuration from `src/config.yaml`. Key settings: |
| - **Backbone:** RoFormer (768 hidden, 8 layers, 12 heads) |
| - **Optimizer:** AdamW (lr=3e-4, weight_decay=0.075) |
| - **Data:** 11M SMILES peptide dataset with dynamic batching by length |
| - **Precision:** fp64 |
| - Checkpoints saved to `checkpoints/` (monitors `val/nll`, saves top 10) |
| |
| ## MCTS-Guided Peptide Generation |
| |
| Generate therapeutic peptides optimized across multiple objectives using Monte-Carlo Tree Search. |
| |
| 1. Fill in `base_path` in `src/config.yaml` and `src/scoring/scoring_functions.py`. |
| 2. Fill in `HOME_LOC` in `scripts/generate_mcts.sh`. |
| 3. Create output directories: `mkdir -p results logs` |
| |
| ```bash |
| chmod +x scripts/generate_mcts.sh |
|
|
| # Usage: ./scripts/generate_mcts.sh [PROT_NAME] [PROT_NAME2] [MODE] [MODEL] [LENGTH] [EPOCH] |
| # Example: Generate peptides targeting GFAP with length 100 |
| nohup ./scripts/generate_mcts.sh gfap "" 2 mcts 100 7 > generate.log 2>&1 & |
| ``` |
| |
| ### Available Target Proteins |
| |
| | Name | Target | |
| |------|--------| |
| | `amhr` | AMH Receptor | |
| | `tfr` | Transferrin Receptor | |
| | `gfap` | Glial Fibrillary Acidic Protein | |
| | `glp1` | GLP-1 Receptor | |
| | `glast` | Excitatory Amino Acid Transporter | |
| | `ncam` | Neural Cell Adhesion Molecule | |
| | `cereblon` | Cereblon (CRBN) | |
| | `ligase` | E3 Ubiquitin Ligase | |
| | `skp2` | S-Phase Kinase-Associated Protein 2 | |
| | `p53` | Tumor Suppressor p53 | |
| | `egfp` | Enhanced Green Fluorescent Protein | |
| |
| To specify a custom target protein, override `+prot_seq=<amino acid sequence>` and `+prot_name=<name>` as Hydra arguments in the generation script. |
| |
| ### Scoring Objectives |
| |
| PepTune jointly optimizes across five therapeutic properties via the integrated scoring suite: |
| |
| | Objective | Property | Model | |
| |-----------|----------|-------| |
| | `binding_affinity1` | Binding affinity to target protein | Cross-attention Transformer | |
| | `solubility` | Aqueous solubility | XGBoost on SMILES CNN embeddings | |
| | `hemolysis` | Non-hemolytic | SMILES binary classifier | |
| | `nonfouling` | Non-fouling | SMILES binary classifier | |
| | `permeability` | Cell membrane permeability | PAMPA CNN | |
| |
| ### Default MCTS Hyperparameters |
| |
| These can be overridden via Hydra config overrides: |
| |
| | Parameter | Default | Description | |
| |-----------|---------|-------------| |
| | `mcts.num_children` | 50 | Branching factor per MCTS node | |
| | `mcts.num_iter` | 128 | Number of MCTS iterations | |
| | `mcts.num_objectives` | 5 | Number of optimization objectives | |
| | `sampling.steps` | 128 | Diffusion denoising steps | |
| | `sampling.seq_length` | 200 | Generated peptide length | |
| |
| ## Unconditional Generation |
| |
| Generate peptides without property guidance: |
| |
| ```bash |
| chmod +x scripts/generate_unconditional.sh |
| |
| nohup ./scripts/generate_unconditional.sh > generate_unconditional.log 2>&1 & |
| ``` |
| |
| ## Evaluation |
| |
| To summarize metrics after generation, fill in `path` and `prot_name` in `src/metrics.py` and run: |
|
|
| ```bash |
| python src/metrics.py |
| ``` |
|
|
| ## Citation |
|
|
| If you find this repository helpful for your publications, please consider citing our paper: |
|
|
| ```bibtex |
| @article{tang2025peptune, |
| title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion}, |
| author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam}, |
| journal={42nd International Conference on Machine Learning}, |
| year={2025} |
| } |
| ``` |
|
|
| ## License |
|
|
| To use this repository, you agree to abide by the Apache 2.0 License. |