SeaMoon: Prediction of Molecular Motions Based on Language Models

SeaMoon is a deep learning framework that predicts protein motions from their amino acid sequences. It leverages embeddings of protein language models, such as the sequence-only-based ESM-2 (Lin et al. 2022), the multimodal ESM3 (Hayes et al. 2024), or the sequence-structure bilingual ProstT5 (Heinzinger et al. 2023). Given a query protein sequence, SeaMoon outputs sets of 3D displacements vectors for each C-alpha atom within an invariant subspace, which can be interpreted as linear motions.

Quick Start

Setup Environment

Create a new conda environment and activate it:

conda create --name seamoon python=3.11.9
conda activate seamoon

Install dependencies:
```
pip install -r requirements.txt
```
If you wish to use --torque-mode (see below) during inference or evaluation, you will need a working version of the Wolfram Engine. Make sure to specify the path to your WolframKernel at line 30 of eval.py. We used Wolfram Engine v14.0.

Test Run

A small test dataset of 100 input samples is included in data_set to validate all main functions. If you wish to generate ground truth data and pre-compute embeddings (ProstT5 by default) for all of them, you can use:

python -m seamoon precompute-w-gt

If you wish to skip pre-computing, pre-computed data for 10 input samples are provided in data_set/training_data. You can launch SeaMoon inference (infer) and prediction evaluation (evaluate) directly on them.

Infer -- predict motion tensors (3 by default) from the input embeddings:
```
python -m seamoon infer
```
Evaluate -- optimally align all predictions against all ground-truth principal components and compute the normalised errors:
```
python -m seamoon evaluate
```

The full dataset from the paper can be downloaded here.

Usage

Pre-compute Embeddings

Pre-compute embeddings using either FASTA or PDB files, optionally specifying the protein language model:

From FASTA:

python -m seamoon precompute-from-fasta --input-files [path-to-fasta-or-list] --output-dir [output-directory] --emb-model [ProstT5|ESM]

From PDB:

python -m seamoon precompute-from-pdb --input-files [path-to-pdb-list] --output-dir [output-directory] --emb-model [ProstT5|ESM]

This mode allows you to specify a protein 3D structure that may be then used to orient the predicted motions (--torque-mode, see below).

From DANCE binaries and alignments (with ground truth to train the model):

python -m seamoon precompute-w-gt --prefixes [file-with-prefixes] --bin-dir [binary-dir] --aln-dir [alignment-dir] --output-dir [output-directory] --emb-model [ProstT5|ESM]

This mode allows you to generate ground-truth data from conformational collections, in addition to the pLM embeddings.

Training

python -m seamoon train --config-path [path-to-config-file]

Inference

python -m seamoon infer --model-path [path-to-model] --config-file [path-to-config] --list-path [path-to-list] --precomputed-path [path-to-precomputed-data] --output-path [output-directory] --batch-size [batch-size] --torque-mode [true|false] --device [cuda|cpu]

By default, the predicted motion tensors will have arbitrary orientations. Set the --torque-mode option to True if you want to align them with respect to a 3D structure. This orientation procedure will produce four solutions that minimize the torque of the structure under the predicted motion.

Evaluation

python -m seamoon evaluate --model-path [path-to-model] --config-file [path-to-config] --list-path [path-to-list] --precomputed-path [path-to-precomputed-data] --output-path [output-directory] --batch-size [batch-size] --torque-mode [true|false] --device [cuda|cpu]

By default, the predicted motion tensors will be optimally aligned with the known ground-truth principal components prior to computing the errors. Set the --torque-mode option to True if you want to compute the errors directly from the predictions oriented through torque minimisation.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data_test		data_test
seamoon		seamoon
weights		weights
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeaMoon: Prediction of Molecular Motions Based on Language Models

Quick Start

Setup Environment

Test Run

Usage

Pre-compute Embeddings

Training

Inference

Evaluation

About

Releases

Packages

Contributors 3

Languages

License

PhyloSofS-Team/seamoon

Folders and files

Latest commit

History

Repository files navigation

SeaMoon: Prediction of Molecular Motions Based on Language Models

Quick Start

Setup Environment

Test Run

Usage

Pre-compute Embeddings

Training

Inference

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages