Skip to content

Commit b683966

Browse files
committed
ProteinGym v1.2 release
1 parent 4c930a5 commit b683966

10 files changed

+301
-286
lines changed

README.md

+12-11
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# ProteinGym
22

3-
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13936340.svg)](https://doi.org/10.5281/zenodo.13936340)
3+
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.14997691.svg)](https://doi.org/10.5281/zenodo.14997691)
44
[![PyPI version](https://badge.fury.io/py/proteingym.svg)](https://badge.fury.io/py/proteingym)
55
[![License: MIT](https://img.shields.io/badge/license-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
66

@@ -96,17 +96,16 @@ ESCOTT | MSA & Structure | [Mustafa Tekpinar, Laurent David, Thomas Henry, Aless
9696
VenusREM | MSA & Structure | [Yang Tan, Ruilin Wang, Banghao Wu, Liang Hong, Bingxin Zhou. (2024). Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model. ArXiv, abs/2410.21127.](https://arxiv.org/abs/2410.21127)
9797
RSALOR | MSA & Structure | [Matsvei Tsishyn, Pauline Hermans, Fabrizio Pucci, Marianne Rooman. (2025). Residue conservation and solvent accessibility are (almost) all you need for predicting mutational effects in proteins. bioRxiv.](https://www.biorxiv.org/content/10.1101/2025.02.03.636212v1)
9898
S3F | Single sequence & Structure | [Zuobai Zhang, Pascal Notin, Yining Huang, Aurelie C. Lozano, Vijil Chenthamarakshan, Debora Marks, Payel Das, Jian Tang. (2024). Multi-Scale Representation Learning for Protein Fitness Prediction. NeurIPS](https://papers.nips.cc/paper_files/paper/2024/hash/b7d795e655c1463d7299688d489e8ef4-Abstract-Conference.html)
99-
100-
Except for the WaveNet model (which only uses alignments to recover a set of homologous protein sequences to train on, but then trains on non-aligned sequences), all alignment-based methods are unable to score indels given the fixed coordinate system they are trained on. Similarly, the masked-marginals procedure to generate the masked-marginals for ESM-1v and MSA Transformer requires the position to exist in the wild-type sequence. All the other model architectures listed above (eg., Tranception, RITA, ProGen2) are included in the indel benchmark.
99+
SiteRM | MSA | [Sebastian Prillo, Wilson Wu, Yun Song. (2024). Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction. NeurIPS.](https://papers.nips.cc/paper_files/paper/2024/hash/eb2f4fb51ac3b8dc4aac9cf71b0e7799-Abstract-Conference.html)
101100

102101
For clinical baselines, we used dbNSFP 4.4a as detailed in the manuscript appendix (and in `proteingym/clinical_benchmark_notebooks/clinical_subs_processing.ipynb`).
103102

104103
## Resources
105104

106-
To download and unzip the data, use the following template, replacing {VERSION} with the desired version number (e.g., "v1.1") and {FILENAME} with the specific file you want to download, as listed in the table below. The latest version is v1.1.
105+
To download and unzip the data, use the following template, replacing {VERSION} with the desired version number (e.g., "v1.2") and {FILENAME} with the specific file you want to download, as listed in the table below. The latest version is v1.2.
107106
For example, you can download & unzip the zero-shot predictions for all baselines for all DMS substitution assays as follows:
108107
```
109-
VERSION="v1.1"
108+
VERSION="v1.2"
110109
FILENAME="DMS_ProteinGym_substitutions.zip"
111110
curl -o ${FILENAME} https://marks.hms.harvard.edu/proteingym/ProteinGym_${VERSION}/${FILENAME}
112111
unzip ${FILENAME} && rm ${FILENAME}
@@ -116,10 +115,10 @@ Data | Size (unzipped) | Filename
116115
--- | --- | --- |
117116
DMS benchmark - Substitutions | 1.0GB | DMS_ProteinGym_substitutions.zip
118117
DMS benchmark - Indels | 200MB | DMS_ProteinGym_indels.zip
119-
Zero-shot DMS Model scores - Substitutions | 31GB | zero_shot_substitutions_scores.zip
120-
Zero-shot DMS Model scores - Indels | 5.2GB | zero_shot_indels_scores.zip
121-
Supervised DMS Model performance - Substitutions | 2.7MB | DMS_supervised_substitutions_scores.zip
122-
Supervised DMS Model performance - Indels | 0.9MB | DMS_supervised_indels_scores.zip
118+
Zero-shot DMS Model scores - Substitutions | 4.4GB | zero_shot_substitutions_scores.zip
119+
Zero-shot DMS Model scores - Indels | 313MB | zero_shot_indels_scores.zip
120+
Supervised DMS Model scores - Substitutions | 3.3GB | DMS_supervised_substitutions_scores.zip
121+
Supervised DMS Model scores - Indels | 215MB | DMS_supervised_indels_scores.zip
123122
Multiple Sequence Alignments (MSAs) for DMS assays | 5.2GB | DMS_msa_files.zip
124123
Redundancy-based sequence weights for DMS assays | 200MB | DMS_msa_weights.zip
125124
Predicted 3D structures from inverse-folding models | 84MB | ProteinGym_AF2_structures.zip
@@ -209,15 +208,17 @@ ESCOTT | http://gitlab.lcqb.upmc.fr/tekpinar/PRESCOTT
209208
VenusREM | https://github.com/tyang816/VenusREM
210209
RSALOR | https://github.com/3BioCompBio/RSALOR
211210
S3F | https://github.com/DeepGraphLearning/S3F
211+
SiteRM | https://github.com/songlab-cal/CherryML
212212

213-
We would like to thank the GEMME team for providing model scores on an earlier version of the benchmark (ProteinGym v0.1), and the ProtSSN, SaProt, PoET, MULAN, VespaG, ProSST, ESCOTT, VenusREM, and RSALOR teams for integrating their model in the ProteinGym repo.
213+
We would like to thank the GEMME team for providing model scores on an earlier version of the benchmark (ProteinGym v0.1), and the ProtSSN, SaProt, PoET, MULAN, VespaG, ProSST, ESCOTT, VenusREM, RSALOR, and SiteRM teams for integrating their model in the ProteinGym repo.
214214

215215
Special thanks the teams of experimentalists who developed and performed the assays that ProteinGym is built on. If you are using ProteinGym in your work, please consider citing the corresponding papers. To facilitate this, we have prepared a file (assays.bib) containing the bibtex entries for all these papers.
216216

217217
## Releases
218218

219219
1. [ProteinGym_v1.0](https://zenodo.org/records/13932633): Initial release.
220220
2. [ProteinGym_v1.1](https://zenodo.org/records/13936340): Updates to reference file, and addition of ProtSSN and SaProt baselines.
221+
3. [ProteinGym_v1.2](https://zenodo.org/records/14997691): Added 8 baselines to the zero-shot DMS substitutions benchmark (eg., VenusREM, S3F, Escott). Added all mutation-level predictions for all baselines in supervised benchmarks.
221222

222223
## License
223224
This project is available under the MIT license found in the LICENSE file in this GitHub repository.
@@ -242,5 +243,5 @@ If you use ProteinGym in your work, please cite the following paper:
242243
- Website: https://www.proteingym.org/
243244
- NeurIPS proceedings: [link to abstract](https://papers.nips.cc/paper_files/paper/2023/hash/cac723e5ff29f65e3fcbb0739ae91bee-Abstract-Datasets_and_Benchmarks.html)
244245
- Preprint: [link to abstract](https://www.biorxiv.org/content/10.1101/2023.12.07.570727v1)
245-
- Zenodo: [link to zenodo](https://zenodo.org/records/13936340)
246+
- Zenodo: [link to zenodo](https://zenodo.org/records/14997691)
246247
- Pypi: [link to pypi](https://pypi.org/project/proteingym/)
Original file line numberDiff line numberDiff line change
@@ -1,67 +1,67 @@
11
DMS_id,ESM-1v Embeddings,MSA Transformer Embeddings,Tranception Embeddings
2-
B1LPA6_ECOSM_Russ_2020_indels,0.547,0.569,0.793
3-
BLAT_ECOLX_Gonzalez_2019_indels,0.366,0.313,0.273
4-
CAPSD_AAV2S_Sinai_2021_designed_indels,0.216,0.205,0.25
5-
CAPSD_AAV2S_Sinai_2021_library_indels,0.434,0.414,0.299
6-
HIS7_YEAST_Pokusaeva_2019_indels,0.161,0.198,0.158
7-
PTEN_HUMAN_Mighell_2018_indels,0.274,0.467,0.365
8-
P53_HUMAN_Kotler_2018_indels,0.315,0.383,0.348
9-
KCNJ2_MOUSE_Macdonald_2022_indels,0.517,0.671,0.391
10-
Q8EG35_SHEON_Campbell_2022_indels,0.57,0.624,0.512
11-
A4_HUMAN_Seuma_2022_indels,0.262,0.36,0.249
12-
S22A1_HUMAN_Yee_2023_abundance_indels,0.42,0.613,0.44
13-
S22A1_HUMAN_Yee_2023_activity_indels,0.427,0.606,0.397
14-
AMFR_HUMAN_Rocklin_2023_4G3O_indels,0.267,0.467,0.848
15-
ARGR_ECOLI_Rocklin_2023_1AOY_indels,0.138,0.259,0.204
16-
BBC1_YEAST_Rocklin_2023_1TG0_indels,0.224,0.328,0.391
17-
BCHB_CHLTE_Rocklin_2023_2KRU_indels,0.194,0.503,0.449
18-
CATR_CHLRE_Rocklin_2023_2AMI_indels,0.192,0.362,0.247
19-
CBPA2_HUMAN_Rocklin_2023_1O6X_indels,0.131,0.292,0.346
20-
CBX4_HUMAN_Rocklin_2023_2K28_indels,0.246,0.423,0.409
21-
CSN4_MOUSE_Rocklin_2023_1UFM_indels,0.087,0.292,0.288
22-
CUE1_YEAST_Rocklin_2023_2MYX_indels,0.216,0.361,0.56
23-
DN7A_SACS2_Rocklin_2023_1JIC_indels,0.282,0.547,0.549
24-
DNJA1_HUMAN_Rocklin_2023_2LO1_indels,0.128,0.25,0.162
25-
DOCK1_MOUSE_Rocklin_2023_2M0Y_indels,0.281,0.491,0.353
26-
EPHB2_HUMAN_Rocklin_2023_1F0M_indels,0.125,0.299,0.307
27-
FECA_ECOLI_Rocklin_2023_2D1U_indels,0.488,0.508,0.592
28-
HCP_LAMBD_Rocklin_2023_2L6Q_indels,0.091,0.267,0.238
29-
HECD1_HUMAN_Rocklin_2023_3DKM_indels,0.63,0.568,0.645
30-
ILF3_HUMAN_Rocklin_2023_2L33_indels,0.135,0.304,0.245
31-
MAFG_MOUSE_Rocklin_2023_1K1V_indels,0.257,0.548,0.493
32-
MBD11_ARATH_Rocklin_2023_6ACV_indels,0.272,0.377,0.597
33-
MYO3_YEAST_Rocklin_2023_2BTT_indels,0.392,0.612,0.514
34-
NKX31_HUMAN_Rocklin_2023_2L9R_indels,0.153,0.367,0.156
35-
NUSA_ECOLI_Rocklin_2023_1WCL_indels,0.154,0.284,0.527
36-
NUSG_MYCTU_Rocklin_2023_2MI6_indels,0.22,0.388,0.327
37-
OBSCN_HUMAN_Rocklin_2023_1V1C_indels,0.178,0.282,0.342
38-
ODP2_GEOSE_Rocklin_2023_1W4G_indels,0.46,0.682,0.543
39-
OTU7A_HUMAN_Rocklin_2023_2L2D_indels,0.216,0.422,0.56
40-
PIN1_HUMAN_Rocklin_2023_1I6C_indels,0.113,0.389,0.269
41-
PITX2_HUMAN_Rocklin_2023_2L7M_indels,0.495,0.773,0.605
42-
PKN1_HUMAN_Rocklin_2023_1URF_indels,0.082,0.178,0.156
43-
POLG_PESV_Rocklin_2023_2MXD_indels,0.198,0.495,0.384
44-
PR40A_HUMAN_Rocklin_2023_1UZC_indels,0.168,0.312,0.279
45-
PSAE_SYNP2_Rocklin_2023_1PSE_indels,0.24,0.259,0.311
46-
RAD_ANTMA_Rocklin_2023_2CJJ_indels,0.193,0.33,0.294
47-
RCD1_ARATH_Rocklin_2023_5OAO_indels,0.251,0.359,0.424
48-
RD23A_HUMAN_Rocklin_2023_1IFY_indels,0.154,0.349,0.29
49-
RPC1_BP434_Rocklin_2023_1R69_indels,0.484,0.511,0.431
50-
RS15_GEOSE_Rocklin_2023_1A32_indels,0.203,0.355,0.331
51-
SAV1_MOUSE_Rocklin_2023_2YSB_indels,0.531,0.708,0.648
52-
SDA_BACSU_Rocklin_2023_1PV0_indels,0.113,0.197,0.184
53-
SOX30_HUMAN_Rocklin_2023_7JJK_indels,0.195,0.336,0.566
54-
SPG2_STRSG_Rocklin_2023_5UBS_indels,0.16,0.256,0.262
55-
SPTN1_CHICK_Rocklin_2023_1TUD_indels,0.237,0.332,0.471
56-
SQSTM_MOUSE_Rocklin_2023_2RRU_indels,0.225,0.559,0.442
57-
SR43C_ARATH_Rocklin_2023_2N88_indels,0.223,0.323,0.235
58-
SRBS1_HUMAN_Rocklin_2023_2O2W_indels,0.153,0.366,0.201
59-
TCRG1_MOUSE_Rocklin_2023_1E0L_indels,0.118,0.22,0.329
60-
THO1_YEAST_Rocklin_2023_2WQG_indels,0.233,0.421,0.623
61-
TNKS2_HUMAN_Rocklin_2023_5JRT_indels,0.22,0.405,0.477
62-
UBE4B_HUMAN_Rocklin_2023_3L1X_indels,0.348,0.431,0.533
63-
UBR5_HUMAN_Rocklin_2023_1I2T_indels,0.196,0.344,0.288
64-
VG08_BPP22_Rocklin_2023_2GP8_indels,0.187,0.291,0.394
65-
VILI_CHICK_Rocklin_2023_1YU5_indels,0.133,0.215,0.184
66-
VRPI_BPT7_Rocklin_2023_2WNM_indels,0.235,0.496,0.474
67-
YNZC_BACSU_Rocklin_2023_2JVD_indels,0.1,0.343,0.303
2+
A4_HUMAN_Seuma_2022_indels,0.283,0.302,0.248
3+
AMFR_HUMAN_Tsuboyama_2023_4G3O_indels,0.323,0.345,0.713
4+
ARGR_ECOLI_Tsuboyama_2023_1AOY_indels,0.148,0.262,0.18
5+
B1LPA6_ECOSM_Russ_2020_indels,0.522,0.552,0.762
6+
BBC1_YEAST_Tsuboyama_2023_1TG0_indels,0.225,0.317,0.357
7+
BCHB_CHLTE_Tsuboyama_2023_2KRU_indels,0.224,0.361,0.319
8+
BLAT_ECOLX_Gonzalez_2019_indels,0.485,0.302,0.288
9+
CAPSD_AAV2S_Sinai_2021_designed_indels,0.441,0.205,0.217
10+
CAPSD_AAV2S_Sinai_2021_library_indels,0.614,0.362,0.321
11+
CATR_CHLRE_Tsuboyama_2023_2AMI_indels,0.163,0.303,0.233
12+
CBPA2_HUMAN_Tsuboyama_2023_1O6X_indels,0.157,0.217,0.302
13+
CBX4_HUMAN_Tsuboyama_2023_2K28_indels,0.26,0.413,0.397
14+
CSN4_MOUSE_Tsuboyama_2023_1UFM_indels,0.088,0.205,0.283
15+
CUE1_YEAST_Tsuboyama_2023_2MYX_indels,0.248,0.258,0.432
16+
DN7A_SACS2_Tsuboyama_2023_1JIC_indels,0.32,0.301,0.437
17+
DNJA1_HUMAN_Tsuboyama_2023_2LO1_indels,0.108,0.192,0.118
18+
DOCK1_MOUSE_Tsuboyama_2023_2M0Y_indels,0.357,0.335,0.341
19+
EPHB2_HUMAN_Tsuboyama_2023_1F0M_indels,0.114,0.252,0.246
20+
FECA_ECOLI_Tsuboyama_2023_2D1U_indels,0.471,0.449,0.598
21+
HCP_LAMBD_Tsuboyama_2023_2L6Q_indels,0.1,0.227,0.142
22+
HECD1_HUMAN_Tsuboyama_2023_3DKM_indels,0.747,0.564,0.656
23+
HIS7_YEAST_Pokusaeva_2019_indels,0.182,0.157,0.164
24+
ILF3_HUMAN_Tsuboyama_2023_2L33_indels,0.157,0.22,0.241
25+
KCNJ2_MOUSE_Macdonald_2022_indels,0.596,0.458,0.423
26+
MAFG_MOUSE_Tsuboyama_2023_1K1V_indels,0.311,0.312,0.262
27+
MBD11_ARATH_Tsuboyama_2023_6ACV_indels,0.37,0.363,0.458
28+
MYO3_YEAST_Tsuboyama_2023_2BTT_indels,0.383,0.512,0.495
29+
NKX31_HUMAN_Tsuboyama_2023_2L9R_indels,0.163,0.252,0.15
30+
NUSA_ECOLI_Tsuboyama_2023_1WCL_indels,0.179,0.26,0.472
31+
NUSG_MYCTU_Tsuboyama_2023_2MI6_indels,0.197,0.308,0.258
32+
OBSCN_HUMAN_Tsuboyama_2023_1V1C_indels,0.194,0.244,0.34
33+
ODP2_GEOSE_Tsuboyama_2023_1W4G_indels,0.516,0.534,0.53
34+
OTU7A_HUMAN_Tsuboyama_2023_2L2D_indels,0.209,0.414,0.361
35+
P53_HUMAN_Kotler_2018_indels,0.406,0.288,0.325
36+
PIN1_HUMAN_Tsuboyama_2023_1I6C_indels,0.12,0.197,0.224
37+
PITX2_HUMAN_Tsuboyama_2023_2L7M_indels,0.387,0.538,0.465
38+
PKN1_HUMAN_Tsuboyama_2023_1URF_indels,0.089,0.181,0.129
39+
POLG_PESV_Tsuboyama_2023_2MXD_indels,0.178,0.3,0.284
40+
PR40A_HUMAN_Tsuboyama_2023_1UZC_indels,0.221,0.268,0.243
41+
PSAE_PICP2_Tsuboyama_2023_1PSE_indels,0.219,0.234,0.313
42+
PTEN_HUMAN_Mighell_2018_indels,0.299,0.264,0.384
43+
Q8EG35_SHEON_Campbell_2022_indels,0.495,0.539,0.534
44+
RAD_ANTMA_Tsuboyama_2023_2CJJ_indels,0.19,0.263,0.224
45+
RCD1_ARATH_Tsuboyama_2023_5OAO_indels,0.306,0.265,0.34
46+
RD23A_HUMAN_Tsuboyama_2023_1IFY_indels,0.151,0.186,0.222
47+
RPC1_BP434_Tsuboyama_2023_1R69_indels,0.476,0.462,0.398
48+
RS15_GEOSE_Tsuboyama_2023_1A32_indels,0.212,0.335,0.236
49+
S22A1_HUMAN_Yee_2023_abundance_indels,0.472,0.471,0.44
50+
S22A1_HUMAN_Yee_2023_activity_indels,0.487,0.373,0.397
51+
SAV1_MOUSE_Tsuboyama_2023_2YSB_indels,0.56,0.625,0.569
52+
SDA_BACSU_Tsuboyama_2023_1PV0_indels,0.157,0.163,0.132
53+
SOX30_HUMAN_Tsuboyama_2023_7JJK_indels,0.188,0.351,0.383
54+
SPG2_STRSG_Tsuboyama_2023_5UBS_indels,0.161,0.255,0.229
55+
SPTN1_CHICK_Tsuboyama_2023_1TUD_indels,0.245,0.262,0.299
56+
SQSTM_MOUSE_Tsuboyama_2023_2RRU_indels,0.232,0.369,0.396
57+
SR43C_ARATH_Tsuboyama_2023_2N88_indels,0.219,0.25,0.271
58+
SRBS1_HUMAN_Tsuboyama_2023_2O2W_indels,0.161,0.288,0.194
59+
TCRG1_MOUSE_Tsuboyama_2023_1E0L_indels,0.109,0.174,0.216
60+
THO1_YEAST_Tsuboyama_2023_2WQG_indels,0.205,0.247,0.423
61+
TNKS2_HUMAN_Tsuboyama_2023_5JRT_indels,0.198,0.349,0.377
62+
UBE4B_HUMAN_Tsuboyama_2023_3L1X_indels,0.362,0.397,0.463
63+
UBR5_HUMAN_Tsuboyama_2023_1I2T_indels,0.236,0.302,0.258
64+
VG08_BPP22_Tsuboyama_2023_2GP8_indels,0.177,0.261,0.266
65+
VILI_CHICK_Tsuboyama_2023_1YU5_indels,0.173,0.177,0.142
66+
VRPI_BPT7_Tsuboyama_2023_2WNM_indels,0.289,0.449,0.393
67+
YNZC_BACSU_Tsuboyama_2023_2JVD_indels,0.108,0.216,0.189

0 commit comments

Comments
 (0)