wanglab/cafa5
Viewer • Updated • 19.3M • 46 • 2
GO-GPT is a decoder-only transformer model for predicting Gene Ontology (GO) terms from protein sequences. It combines ESM2 protein language model embeddings with an autoregressive decoder to generate GO term annotations across all three ontology aspects: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).
git clone https://github.com/YOUR_ORG/gogpt
cd gogpt
import sys
sys.path.insert(0, "src")
from gogpt import GOGPTPredictor
# Load from HuggingFace (downloads ~4GB on first run)
predictor = GOGPTPredictor.from_pretrained("armansa1/gogpt-dev")
# Predict GO terms
predictions = predictor.predict(
sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH",
organism="Homo sapiens"
)
print(predictions)
# {'MF': ['GO:0003674', 'GO:0005488', ...],
# 'BP': ['GO:0008150', 'GO:0008152', ...],
# 'CC': ['GO:0005575', 'GO:0110165', ...]}
| Component | Description |
|---|---|
| Protein Encoder | ESM2-3B (facebook/esm2_t36_3B_UR50D) |
| Decoder | 12-layer GPT with prefix causal attention |
| Embedding Dim | 900 |
| Attention Heads | 12 |
| Total Parameters | ~3.2B (3B ESM2 + 200M decoder) |
GO-GPT supports organism-conditioned prediction for 200 organisms plus an <UNKNOWN> category (201 total). See organism_list.txt for the full list.
Common organisms include:
For organisms not in the training set, predictions will use the <UNKNOWN> embedding.
| File | Description |
|---|---|
model.ckpt |
Model weights (PyTorch Lightning checkpoint) |
config.yaml |
Model architecture configuration |
tokenizer_info.json |
Token vocabulary metadata |
go_tokenizer.json |
GO term to token ID mapping |
organism_mapper.json |
Organism name to ID mapping |
organism_list.txt |
Human-readable list of 201 supported organisms |