Skip to content

PhyloSofS-Team/seamoon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SeaMoon: Prediction of Molecular Motions Based on Language Models

SeaMoon is a deep learning framework that predicts protein motions from their amino acid sequences. It leverages embeddings of protein language models, such as the sequence-only-based ESM-2 (Lin et al. 2022), the multimodal ESM3 (Hayes et al. 2024), or the sequence-structure bilingual ProstT5 (Heinzinger et al. 2023). Given a query protein sequence, SeaMoon outputs sets of 3D displacements vectors for each C-alpha atom within an invariant subspace, which can be interpreted as linear motions.

Quick Start

Setup Environment

  1. Create a new conda environment and activate it:

    conda create --name seamoon python=3.11.9
    conda activate seamoon
  2. Install dependencies:

    pip install -r requirements.txt
  3. If you wish to use --torque-mode (see below) during inference or evaluation, you will need a working version of the Wolfram Engine. Make sure to specify the path to your WolframKernel at line 30 of eval.py. We used Wolfram Engine v14.0.

Test Run

A small test dataset of 100 input samples is included in data_set to validate all main functions. If you wish to generate ground truth data and pre-compute embeddings (ProstT5 by default) for all of them, you can use:

python -m seamoon precompute-w-gt

If you wish to skip pre-computing, pre-computed data for 10 input samples are provided in data_set/training_data. You can launch SeaMoon inference (infer) and prediction evaluation (evaluate) directly on them.

  • Infer -- predict motion tensors (3 by default) from the input embeddings:

    python -m seamoon infer
  • Evaluate -- optimally align all predictions against all ground-truth principal components and compute the normalised errors:

    python -m seamoon evaluate

The full dataset from the paper can be downloaded here.

Usage

Pre-compute Embeddings

Pre-compute embeddings using either FASTA or PDB files, optionally specifying the protein language model:

  • From FASTA:

    python -m seamoon precompute-from-fasta --input-files [path-to-fasta-or-list] --output-dir [output-directory] --emb-model [ProstT5|ESM]
  • From PDB:

    python -m seamoon precompute-from-pdb --input-files [path-to-pdb-list] --output-dir [output-directory] --emb-model [ProstT5|ESM]

This mode allows you to specify a protein 3D structure that may be then used to orient the predicted motions (--torque-mode, see below).

  • From DANCE binaries and alignments (with ground truth to train the model):
    python -m seamoon precompute-w-gt --prefixes [file-with-prefixes] --bin-dir [binary-dir] --aln-dir [alignment-dir] --output-dir [output-directory] --emb-model [ProstT5|ESM]

This mode allows you to generate ground-truth data from conformational collections, in addition to the pLM embeddings.

Training

python -m seamoon train --config-path [path-to-config-file]

Inference

python -m seamoon infer --model-path [path-to-model] --config-file [path-to-config] --list-path [path-to-list] --precomputed-path [path-to-precomputed-data] --output-path [output-directory] --batch-size [batch-size] --torque-mode [true|false] --device [cuda|cpu]

By default, the predicted motion tensors will have arbitrary orientations. Set the --torque-mode option to True if you want to align them with respect to a 3D structure. This orientation procedure will produce four solutions that minimize the torque of the structure under the predicted motion.

Evaluation

python -m seamoon evaluate --model-path [path-to-model] --config-file [path-to-config] --list-path [path-to-list] --precomputed-path [path-to-precomputed-data] --output-path [output-directory] --batch-size [batch-size] --torque-mode [true|false] --device [cuda|cpu]

By default, the predicted motion tensors will be optimally aligned with the known ground-truth principal components prior to computing the errors. Set the --torque-mode option to True if you want to compute the errors directly from the predictions oriented through torque minimisation.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages