SaProtHub
/

EC-classification-35M

Model card Files Files and versions

ysy20020107 commited on Mar 24

Commit

42dc529

·

verified ·

1 Parent(s): db34cae

Update README.md

Files changed (1) hide show

README.md +49 -5

README.md CHANGED Viewed

@@ -7,19 +7,64 @@ metrics:
 accuracy: 0.68
 ---
-# Model Card for Model-Demo-35M
-<slot name='description'>
 ## Task type
 Protein-level Classification
 ## Model input type
-AA Sequence
-## LoRA config
 - **r:** 8
 - **lora_dropout:** 0.1
 - **lora_alpha:** 16
@@ -27,7 +72,6 @@ AA Sequence
 - **modules_to_save:** ['classifier']
 ## Training config
 - **optimizer:**
   - **class:** AdamW
   - **betas:** (0.9, 0.98)

 accuracy: 0.68
 ---
+# Model Card for Model-Demo-35M
+## Description
+This is a protein **EC number classification model** based on SaProt_35M_AF2, fine-tuned with LoRA. The model can classify proteins into **6 major EC classes (EC1-EC6)**. Since there are only 31 samples for EC7 in the raw dataset, this class is excluded from training and prediction.
+Label mapping:
+- **Label 0**: Oxidoreductase (EC1)
+- **Label 1**: Transferase (EC2)
+- **Label 2**: Hydrolase (EC3)
+- **Label 3**: Lyase (EC4)
+- **Label 4**: Isomerase (EC5)
+- **Label 5**: Ligase (EC6)
+Training data is obtained from: https://academic.oup.com/nar/article/54/D1/D643/8313833
+To address the **class imbalance problem** in the training set, we performed data augmentation:
+- Label 4 (EC5) samples were duplicated **2 times**
+- Label 5 (EC6) samples were duplicated **1 time**
 ## Task type
 Protein-level Classification
 ## Model input type
+Amino acid sequence (AA Sequence)
+## Dataset Distribution
+### Training set
+- Label 0: 1497 (28.5%)
+- Label 2: 1217 (23.2%)
+- Label 1: 1050 (19.9%)
+- Label 3: 512 (9.7%)
+- Label 4: 496 (9.4%)
+- Label 5: 483 (9.2%)
+Total: 5255 samples
+### Validation set
+- Label 0: 187 (32.0%)
+- Label 2: 152 (26.0%)
+- Label 1: 131 (22.4%)
+- Label 3: 64 (10.9%)
+- Label 4: 31 (5.3%)
+- Label 5: 20 (3.4%)
+Total: 585 samples
+### Test set
+- Label 0: 188 (31.8%)
+- Label 2: 153 (25.9%)
+- Label 1: 132 (22.3%)
+- Label 3: 65 (11.0%)
+- Label 4: 32 (5.4%)
+- Label 5: 21 (3.5%)
+Total: 591 samples
+## Performance (on test set)
+- **Accuracy: 0.68**
+## LoRA config
 - **r:** 8
 - **lora_dropout:** 0.1
 - **lora_alpha:** 16
 - **modules_to_save:** ['classifier']
 ## Training config
 - **optimizer:**
   - **class:** AdamW
   - **betas:** (0.9, 0.98)