The datasets provided here aim at predicting protein binding (2-state). We provide the following four datasets:
binding_metal.fasta: Binding to metal ions (0/1)binding_nuclear.fasta: Binding to nucleic acids (0/1)binding_small.fasta: Binding to small molecules (0/1)binding_combined.fasta: Binding to metal, nucleic acids OR small molecules (0/1)
The provided dataset was compiled from the data provided in the bindEmbed repository.
- Training: Data from the development set
- Validation: Stratified random 10% split of training
- Test: Data from the independent set
The dataset is provided in biotrainer-ready fasta format. Each entry contains a sequence and a header, providing the sequence id, the set (train/val/test) and the target label.
The bindEmbed paper contains benchmarks for the binding prediction tasks. The TestSetNew46 is the independent set used for these datasets.
@Article{Littmann2021b,
author = {Littmann, Maria and Heinzinger, Michael and Dallago, Christian and Weissenow, Konstantin and Rost, Burkhard},
journal = {Scientific Reports},
title = {Protein embeddings and deep learning predict binding residues for various ligand classes},
year = {2021},
issn = {2045-2322},
month = dec,
number = {1},
volume = {11},
doi = {10.1038/s41598-021-03431-4},
publisher = {Springer Science and Business Media LLC},
}The RAW data downloaded from the aforementioned publication is subject to the MIT license. Modified data available in this repository falls under AFL-3.