GO-GPT: Gene Ontology Prediction from Protein Sequences

GO-GPT is a decoder-only transformer model for predicting Gene Ontology (GO) terms from protein sequences. It combines ESM2 protein language model embeddings with an autoregressive decoder to generate GO term annotations across all three ontology aspects: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).

Quick Start

Clone the repository:

git clone https://github.com/YOUR_ORG/gogpt
cd gogpt

Run the inference notebook or use Python directly:

import sys
sys.path.insert(0, "src")

from gogpt import GOGPTPredictor

# Load from HuggingFace (downloads ~4GB on first run)
predictor = GOGPTPredictor.from_pretrained("armansa1/gogpt-dev")

# Predict GO terms
predictions = predictor.predict(
    sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH",
    organism="Homo sapiens"
)

print(predictions)
# {'MF': ['GO:0003674', 'GO:0005488', ...],
#  'BP': ['GO:0008150', 'GO:0008152', ...],
#  'CC': ['GO:0005575', 'GO:0110165', ...]}

Model Architecture

Component	Description
Protein Encoder	ESM2-3B (`facebook/esm2_t36_3B_UR50D`)
Decoder	12-layer GPT with prefix causal attention
Embedding Dim	900
Attention Heads	12
Total Parameters	~3.2B (3B ESM2 + 200M decoder)

Supported Organisms

GO-GPT supports organism-conditioned prediction for 200 organisms plus an <UNKNOWN> category (201 total). See organism_list.txt for the full list.

Common organisms include:

Homo sapiens
Mus musculus
Escherichia coli (various strains)
Saccharomyces cerevisiae
Arabidopsis thaliana
Drosophila melanogaster

For organisms not in the training set, predictions will use the <UNKNOWN> embedding.

Files in This Repository

File	Description
`model.ckpt`	Model weights (PyTorch Lightning checkpoint)
`config.yaml`	Model architecture configuration
`tokenizer_info.json`	Token vocabulary metadata
`go_tokenizer.json`	GO term to token ID mapping
`organism_mapper.json`	Organism name to ID mapping
`organism_list.txt`	Human-readable list of 201 supported organisms

Downloads last month: 123

armansa1
/

gogpt-test

GO-GPT: Gene Ontology Prediction from Protein Sequences

Quick Start

Model Architecture

Supported Organisms

Files in This Repository

Dataset used to train armansa1/gogpt-test