-
Notifications
You must be signed in to change notification settings - Fork 1
Adding multi file prediction with Array Job #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,8 +1,8 @@ | ||||||
|
|
||||||
|
|
||||||
| # Protein Folding on GPU | ||||||
| # Protein Folding on GPU | ||||||
|
|
||||||
| This directory provides an example workflow for running **Boltz-based protein folding using GPU resources**. | ||||||
| This directory provides an example workflow for running **Boltz-based protein folding using GPU resources**. | ||||||
| It is designed for environments with GPU access, offering reproducible and accessible protein structure prediction. | ||||||
|
|
||||||
| --- | ||||||
|
|
@@ -11,7 +11,7 @@ It is designed for environments with GPU access, offering reproducible and acces | |||||
|
|
||||||
| This workflow executes a protein structure prediction pipeline on GPU using the **Boltz** framework. It demonstrates: | ||||||
|
|
||||||
| - Running **ColabFold** search locally on the Kempner Cluster | ||||||
| - Running **ColabFold** search locally on the Kempner Cluster | ||||||
| - Using the generated MSA file (`.a3m` extension) as input to **Boltz** for structure prediction | ||||||
|
|
||||||
| --- | ||||||
|
|
@@ -20,19 +20,19 @@ This workflow executes a protein structure prediction pipeline on GPU using the | |||||
|
|
||||||
| - Python ≥ 3.10 | ||||||
| - cuda and cudann libraries | ||||||
| - **Boltz** library | ||||||
| - **ColabFold** | ||||||
| - **Boltz database** | ||||||
| - **ColabFold database** | ||||||
| - **Boltz** library | ||||||
| - **ColabFold** | ||||||
| - **Boltz database** | ||||||
| - **ColabFold database** | ||||||
|
|
||||||
| > **Note:** All of these are pre-installed on the Kempner Cluster. | ||||||
| > **Note:** All of these are pre-installed on the Kempner Cluster. | ||||||
| > Installation in your own space is optional. | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Input Format | ||||||
|
|
||||||
| Create an input FASTA file. | ||||||
| Create an input FASTA file. | ||||||
| **Important:** Currently, the pipeline supports only FASTA format. | ||||||
|
|
||||||
| **Example:** | ||||||
|
|
@@ -45,7 +45,7 @@ QLEDSEVEAVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISSVEEELNVNISDSCVANKIKDEFFAMISISAIVKAAQKK | |||||
|
|
||||||
| ## Running the Workflow | ||||||
|
|
||||||
| Open the file file `boltz_single_pipeline_gpu.slrm` and define the variable with the correct input fasta filename, and the GPU specifications. | ||||||
| Open the file file `boltz_single_pipeline_gpu.slrm` and define the variable with the correct input fasta filename, and the GPU specifications. | ||||||
| ``` | ||||||
| INPUT_FASTA="input.fa" | ||||||
| export CUDA_VISIBLE_DEVICES=0 | ||||||
|
|
@@ -59,10 +59,33 @@ To submit the Slurm batch job: | |||||
| sbatch boltz_single_pipeline_gpu.slrm | ||||||
| ``` | ||||||
|
|
||||||
| Update the SLURM script to adjust job resources (e.g., GPU. CPU cores, memory) as needed. You need to add partition name and account name. | ||||||
| Update the SLURM script to adjust job resources (e.g., GPU. CPU cores, memory) as needed. You need to add partition name and account name. | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
|
|
||||||
| ### Generating Boltz Predictions on Multiple Fasta Files | ||||||
|
|
||||||
| To generate the Boltz predictions on the Kempner cluster you run from the login node the following command: | ||||||
|
|
||||||
| ```{bash} | ||||||
| source slrm_scripts/multi_pred.sh INPUT_DIR N OUT_DIR | ||||||
| ``` | ||||||
|
|
||||||
| The script will: | ||||||
|
|
||||||
| - Divide the input dir files into n sets, generate .txt containing the path to each .fasta (one per set) | ||||||
| - create an out_dir/chunks_timestamp/ directory where the predictions will be stored | ||||||
|
|
||||||
| - start N jobs launching the script: slrm_scripts/single_prediction.slrm n times (you can modify the resource of each job by modifying this script) | ||||||
|
||||||
| - start N jobs launching the script: slrm_scripts/single_prediction.slrm n times (you can modify the resource of each job by modifying this script) | |
| - start N jobs launching the script: ./single_prediction_array.slrm n times (you can modify the resource of each job by modifying this script) |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,104 @@ | ||||||||||||||||||||||||||||||||||||
| #!/bin/bash | ||||||||||||||||||||||||||||||||||||
| #SBATCH --nodes=1 | ||||||||||||||||||||||||||||||||||||
| #SBATCH --ntasks-per-node=1 | ||||||||||||||||||||||||||||||||||||
| #SBATCH --cpus-per-task=16 | ||||||||||||||||||||||||||||||||||||
| #SBATCH --gpus-per-node=1 | ||||||||||||||||||||||||||||||||||||
| #SBATCH --mem=256GB | ||||||||||||||||||||||||||||||||||||
| #SBATCH --partition=kempner_requeue | ||||||||||||||||||||||||||||||||||||
| #SBATCH --account=kempner_bsabatini_lab | ||||||||||||||||||||||||||||||||||||
| #SBATCH --time=4:00:00 | ||||||||||||||||||||||||||||||||||||
| #SBATCH --mail-type=ALL | ||||||||||||||||||||||||||||||||||||
| #SBATCH --mail-user=thomasbush52@gmail.com | ||||||||||||||||||||||||||||||||||||
| # Use array-aware log names to avoid clobbering: | ||||||||||||||||||||||||||||||||||||
| #SBATCH --output=/n/home06/tbush/job_logs/%x.%A_%a.out | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| set -euo pipefail | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
Comment on lines
+11
to
+17
|
||||||||||||||||||||||||||||||||||||
| #SBATCH --mail-user=thomasbush52@gmail.com | |
| # Use array-aware log names to avoid clobbering: | |
| #SBATCH --output=/n/home06/tbush/job_logs/%x.%A_%a.out | |
| set -euo pipefail | |
| #SBATCH --mail-user=${MAIL_USER} | |
| # Use array-aware log names to avoid clobbering: | |
| #SBATCH --output=/n/home06/tbush/job_logs/%x.%A_%a.out | |
| # Set MAIL_USER environment variable to your email before submitting, e.g.: | |
| # export MAIL_USER=your.email@domain.com | |
| set -euo pipefail | |
| # Ensure MAIL_USER is set | |
| : "${MAIL_USER:?MAIL_USER environment variable not set. Please set it to your email address before submitting.}" |
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,90 @@ | ||||||||||
| #!/bin/bash | ||||||||||
| set -euo pipefail | ||||||||||
|
|
||||||||||
| # Usage: ./split_and_submit.sh INPUT_DIR N OUTPUT_PARENT_DIR | ||||||||||
| # Example: ./split_and_submit.sh /data/images 5 /data/jobs | ||||||||||
|
Comment on lines
+4
to
+5
|
||||||||||
| # Usage: ./split_and_submit.sh INPUT_DIR N OUTPUT_PARENT_DIR | |
| # Example: ./split_and_submit.sh /data/images 5 /data/jobs | |
| # Usage: ./split_and_pred.sh INPUT_DIR N OUTPUT_PARENT_DIR | |
| # Example: ./split_and_pred.sh /data/fasta_files 5 /data/jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation references 'slrm_scripts/multi_pred.sh' but the actual script is named 'split_and_pred.sh' and located in the current directory, not in a 'slrm_scripts' subdirectory.