PepTune / README.md
sophtang's picture
Update README.md
1779b9c verified
---
license: apache-2.0
---
# [PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion](https://arxiv.org/abs/2412.17780) ๐Ÿงฌ๐Ÿ”ฎ (ICML 2025)
[**Sophia Tang**](https://sophtang.github.io/)\*, [**Yinuo Zhang**](https://www.linkedin.com/in/yinuozhang98/)\* and [**Pranam Chatterjee**](https://www.chatterjeelab.com/)
![PepTune](assets/poster.png)
This is the repository for **[PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion](https://arxiv.org/abs/2412.17780)** ๐Ÿงฌ๐Ÿ”ฎ published at **ICML 2025**. It is partially built on the **[MDLM repo](https://github.com/kuleshov-group/mdlm)** ([Sahoo et al. 2024](https://arxiv.org/abs/2406.07524)).
PepTune leverages **Monte-Carlo Tree Search (MCTS)** to guide a generative masked discrete diffusion model which iteratively refines a set of Pareto non-dominated sequences optimized across a set of therapeutic properties, including binding affinity, cell membrane permeability, solubility, non-fouling, and non-hemolysis.
## Environment Installation
```bash
conda env create -f src/environment.yml
conda activate peptune
```
## Model Pretrained Weights Download
Follow the steps below to download the model weights required for this experiment.
1. Download the PepTune pre-trained MDLM checkpoint and place in `checkpoints/`: https://drive.google.com/file/d/1oXGDpKLNF0KX0ZdOcl1NZj5Czk2lSFUn/view?usp=sharing
2. Download the pre-trained binding affinity Transformer model and place in `src/scoring/functions/classifiers/`: https://drive.google.com/file/d/128shlEP_-rYAxPgZRCk_n0HBWVbOYSva/view?usp=sharing
## Training Data Download
Download the peptide training dataset from https://drive.google.com/file/d/1yCDr641WVjCtECg3nbG0nsMNu8j7d7gp/view?usp=drive_link and unzip it into the `data/` directory:
```bash
# Download peptide_data.zip into the data/ directory
cd data/
# Unzip the training data
unzip peptide_data.zip
cd ..
```
After unzipping, the data should be located at `data/peptide_data/`.
## Repository Structure
```
PepTune/
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ train_peptune.py # Main training script
โ”‚ โ”œโ”€โ”€ generate_mcts.py # MCTS-guided peptide generation
โ”‚ โ”œโ”€โ”€ generate_unconditional.py # Unconditional generation
โ”‚ โ”œโ”€โ”€ diffusion.py # Core masked discrete diffusion model
โ”‚ โ”œโ”€โ”€ pareto_mcts.py # Pareto-front MCTS implementation
โ”‚ โ”œโ”€โ”€ roformer.py # RoFormer backbone
โ”‚ โ”œโ”€โ”€ noise_schedule.py # Noise scheduling (loglinear, logpoly)
โ”‚ โ”œโ”€โ”€ config.yaml # Hydra configuration
โ”‚ โ”œโ”€โ”€ config.py # Argparse configuration
โ”‚ โ”œโ”€โ”€ environment.yml # Conda environment
โ”‚ โ”œโ”€โ”€ scoring/ # Therapeutic property scoring
โ”‚ โ”‚ โ”œโ”€โ”€ scoring_functions.py # Unified scoring interface
โ”‚ โ”‚ โ””โ”€โ”€ functions/ # Individual property predictors
โ”‚ โ”‚ โ”œโ”€โ”€ binding.py
โ”‚ โ”‚ โ”œโ”€โ”€ hemolysis.py
โ”‚ โ”‚ โ”œโ”€โ”€ nonfouling.py
โ”‚ โ”‚ โ”œโ”€โ”€ permeability.py
โ”‚ โ”‚ โ”œโ”€โ”€ solubility.py
โ”‚ โ”‚ โ””โ”€โ”€ classifiers/ # Pre-trained scoring model weights
โ”‚ โ”œโ”€โ”€ tokenizer/ # SMILES SPE tokenizer
โ”‚ โ”‚ โ”œโ”€โ”€ my_tokenizers.py
โ”‚ โ”‚ โ”œโ”€โ”€ new_vocab.txt
โ”‚ โ”‚ โ””โ”€โ”€ new_splits.txt
โ”‚ โ””โ”€โ”€ utils/ # Utilities & PeptideAnalyzer
โ”‚ โ”œโ”€โ”€ app.py
โ”‚ โ”œโ”€โ”€ generate_utils.py
โ”‚ โ””โ”€โ”€ utils.py
โ”œโ”€โ”€ scripts/ # Shell scripts for running experiments
โ”‚ โ”œโ”€โ”€ train.sh # Pre-training
โ”‚ โ”œโ”€โ”€ generate_mcts.sh # MCTS-guided generation
โ”‚ โ””โ”€โ”€ generate_unconditional.sh # Unconditional generation
โ”œโ”€โ”€ data/ # Training data
โ”‚ โ”œโ”€โ”€ dataloading_for_dynamic_batching.py
โ”‚ โ””โ”€โ”€ dataset.py
โ”œโ”€โ”€ checkpoints/ # Model checkpoints
โ””โ”€โ”€ assets/ # Figures
```
## Pre-training
Before running, fill in `HOME_LOC` and `ENV_LOC` in `scripts/train.sh` and `base_path` in `src/config.yaml` to match your paths.
```bash
chmod +x scripts/train.sh
nohup ./scripts/train.sh > train.log 2>&1 &
```
Training uses Hydra configuration from `src/config.yaml`. Key settings:
- **Backbone:** RoFormer (768 hidden, 8 layers, 12 heads)
- **Optimizer:** AdamW (lr=3e-4, weight_decay=0.075)
- **Data:** 11M SMILES peptide dataset with dynamic batching by length
- **Precision:** fp64
- Checkpoints saved to `checkpoints/` (monitors `val/nll`, saves top 10)
## MCTS-Guided Peptide Generation
Generate therapeutic peptides optimized across multiple objectives using Monte-Carlo Tree Search.
1. Fill in `base_path` in `src/config.yaml` and `src/scoring/scoring_functions.py`.
2. Fill in `HOME_LOC` in `scripts/generate_mcts.sh`.
3. Create output directories: `mkdir -p results logs`
```bash
chmod +x scripts/generate_mcts.sh
# Usage: ./scripts/generate_mcts.sh [PROT_NAME] [PROT_NAME2] [MODE] [MODEL] [LENGTH] [EPOCH]
# Example: Generate peptides targeting GFAP with length 100
nohup ./scripts/generate_mcts.sh gfap "" 2 mcts 100 7 > generate.log 2>&1 &
```
### Available Target Proteins
| Name | Target |
|------|--------|
| `amhr` | AMH Receptor |
| `tfr` | Transferrin Receptor |
| `gfap` | Glial Fibrillary Acidic Protein |
| `glp1` | GLP-1 Receptor |
| `glast` | Excitatory Amino Acid Transporter |
| `ncam` | Neural Cell Adhesion Molecule |
| `cereblon` | Cereblon (CRBN) |
| `ligase` | E3 Ubiquitin Ligase |
| `skp2` | S-Phase Kinase-Associated Protein 2 |
| `p53` | Tumor Suppressor p53 |
| `egfp` | Enhanced Green Fluorescent Protein |
To specify a custom target protein, override `+prot_seq=<amino acid sequence>` and `+prot_name=<name>` as Hydra arguments in the generation script.
### Scoring Objectives
PepTune jointly optimizes across five therapeutic properties via the integrated scoring suite:
| Objective | Property | Model |
|-----------|----------|-------|
| `binding_affinity1` | Binding affinity to target protein | Cross-attention Transformer |
| `solubility` | Aqueous solubility | XGBoost on SMILES CNN embeddings |
| `hemolysis` | Non-hemolytic | SMILES binary classifier |
| `nonfouling` | Non-fouling | SMILES binary classifier |
| `permeability` | Cell membrane permeability | PAMPA CNN |
### Default MCTS Hyperparameters
These can be overridden via Hydra config overrides:
| Parameter | Default | Description |
|-----------|---------|-------------|
| `mcts.num_children` | 50 | Branching factor per MCTS node |
| `mcts.num_iter` | 128 | Number of MCTS iterations |
| `mcts.num_objectives` | 5 | Number of optimization objectives |
| `sampling.steps` | 128 | Diffusion denoising steps |
| `sampling.seq_length` | 200 | Generated peptide length |
## Unconditional Generation
Generate peptides without property guidance:
```bash
chmod +x scripts/generate_unconditional.sh
nohup ./scripts/generate_unconditional.sh > generate_unconditional.log 2>&1 &
```
## Evaluation
To summarize metrics after generation, fill in `path` and `prot_name` in `src/metrics.py` and run:
```bash
python src/metrics.py
```
## Citation
If you find this repository helpful for your publications, please consider citing our paper:
```bibtex
@article{tang2025peptune,
title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
journal={42nd International Conference on Machine Learning},
year={2025}
}
```
## License
To use this repository, you agree to abide by the Apache 2.0 License.