ysy20020107 commited on
Commit
42dc529
·
verified ·
1 Parent(s): db34cae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -5
README.md CHANGED
@@ -7,19 +7,64 @@ metrics:
7
  accuracy: 0.68
8
  ---
9
 
 
10
 
 
 
11
 
12
- # Model Card for Model-Demo-35M
13
- <slot name='description'>
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ## Task type
16
  Protein-level Classification
17
 
18
  ## Model input type
19
- AA Sequence
20
 
21
- ## LoRA config
 
 
 
 
 
 
 
 
 
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  - **r:** 8
24
  - **lora_dropout:** 0.1
25
  - **lora_alpha:** 16
@@ -27,7 +72,6 @@ AA Sequence
27
  - **modules_to_save:** ['classifier']
28
 
29
  ## Training config
30
-
31
  - **optimizer:**
32
  - **class:** AdamW
33
  - **betas:** (0.9, 0.98)
 
7
  accuracy: 0.68
8
  ---
9
 
10
+ # Model Card for Model-Demo-35M
11
 
12
+ ## Description
13
+ This is a protein **EC number classification model** based on SaProt_35M_AF2, fine-tuned with LoRA. The model can classify proteins into **6 major EC classes (EC1-EC6)**. Since there are only 31 samples for EC7 in the raw dataset, this class is excluded from training and prediction.
14
 
15
+ Label mapping:
16
+ - **Label 0**: Oxidoreductase (EC1)
17
+ - **Label 1**: Transferase (EC2)
18
+ - **Label 2**: Hydrolase (EC3)
19
+ - **Label 3**: Lyase (EC4)
20
+ - **Label 4**: Isomerase (EC5)
21
+ - **Label 5**: Ligase (EC6)
22
+
23
+ Training data is obtained from: https://academic.oup.com/nar/article/54/D1/D643/8313833
24
+
25
+ To address the **class imbalance problem** in the training set, we performed data augmentation:
26
+ - Label 4 (EC5) samples were duplicated **2 times**
27
+ - Label 5 (EC6) samples were duplicated **1 time**
28
 
29
  ## Task type
30
  Protein-level Classification
31
 
32
  ## Model input type
33
+ Amino acid sequence (AA Sequence)
34
 
35
+ ## Dataset Distribution
36
+
37
+ ### Training set
38
+ - Label 0: 1497 (28.5%)
39
+ - Label 2: 1217 (23.2%)
40
+ - Label 1: 1050 (19.9%)
41
+ - Label 3: 512 (9.7%)
42
+ - Label 4: 496 (9.4%)
43
+ - Label 5: 483 (9.2%)
44
+ Total: 5255 samples
45
 
46
+ ### Validation set
47
+ - Label 0: 187 (32.0%)
48
+ - Label 2: 152 (26.0%)
49
+ - Label 1: 131 (22.4%)
50
+ - Label 3: 64 (10.9%)
51
+ - Label 4: 31 (5.3%)
52
+ - Label 5: 20 (3.4%)
53
+ Total: 585 samples
54
+
55
+ ### Test set
56
+ - Label 0: 188 (31.8%)
57
+ - Label 2: 153 (25.9%)
58
+ - Label 1: 132 (22.3%)
59
+ - Label 3: 65 (11.0%)
60
+ - Label 4: 32 (5.4%)
61
+ - Label 5: 21 (3.5%)
62
+ Total: 591 samples
63
+
64
+ ## Performance (on test set)
65
+ - **Accuracy: 0.68**
66
+
67
+ ## LoRA config
68
  - **r:** 8
69
  - **lora_dropout:** 0.1
70
  - **lora_alpha:** 16
 
72
  - **modules_to_save:** ['classifier']
73
 
74
  ## Training config
 
75
  - **optimizer:**
76
  - **class:** AdamW
77
  - **betas:** (0.9, 0.98)