Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 16 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,7 @@ Kim, D., Park, S., & Steinegger, M. (2024). Unicore enables scalable and accurat
## Table of Contents
- [Unicore](#unicore)
- [Quick Start with Conda](#quick-start-with-conda)
- [GPU acceleration with CUDA](#gpu-acceleration-with-cuda)
- [GPU acceleration with Foldseek-ProstT5 (beta)](#gpu-acceleration-with-foldseek-prostt5-beta)
- [GPU acceleration with Foldseek-ProstT5](#gpu-acceleration-with-foldseek-prostt5)
- [Tutorial](#tutorial)
- [Manual](#manual)
- [Input](#input)
Expand All @@ -29,24 +28,16 @@ conda install -c bioconda unicore
unicore -v
```

### GPU acceleration with CUDA
`createdb` module can be greatly acclerated with ProstT5-GPU.
If you have a Linux machine with CUDA-compatible GPU, please install this additional package:
```
conda install -c conda-forge pytorch-gpu
```
### GPU acceleration with Foldseek-ProstT5
Foldseek features GPU-acceleration for ProstT5 prediction under following requirements:
* Turing or newer NVIDIA GPU
* `foldseek` ≥10
* `glibc` ≥2.17
* `nvidia-driver` ≥525.60.13

### GPU acceleration with Foldseek-ProstT5 (beta)
> Note. This feature is under development and may not work in some environments. We will provide an update after the stable release of Foldseek-ProstT5.

Foldseek provides a GPU-compatible static binary for ProstT5 prediction (requires Linux with AVX2 support, `glibc` ≥2.29, and `nvidia-driver` ≥525.60.13)<br>
To use it, please install it by running the following command:
```
wget https://mmseqs.com/foldseek/foldseek-linux-gpu.tar.gz; tar xvfz foldseek-linux-gpu.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
Apply `--gpu` option to either `easy-core` or `createdb` module to use it, e.g.
```
Then, add `--use-foldseek` and `--gpu` options to either `easy-core` or `createdb` module to use Foldseek implementation of ProstT5-GPU:
```
unicore easy-core --use-foldseek --gpu <INPUT> <OUTPUT> <MODEL> <TMP>
unicore easy-core --gpu <INPUT> <OUTPUT> <MODEL> <TMP>
```

<hr>
Expand All @@ -61,7 +52,7 @@ unzip unicore_example.zip
If you cloned the repository, you can find the example dataset in the `example/data` folder.

### Download ProstT5 weights
You need to first download the ProstT5 weights to run the `createdb` module.
You can preliminarily download the ProstT5 weights required to run the `createdb` module.
```
foldseek databases ProstT5 weights tmp
```
Expand Down Expand Up @@ -142,13 +133,14 @@ This module runs much faster with GPU. Please install `cuda` for GPU acceleratio

To run the module, please use the following command:
```
// Download ProstT5 weights as below if you haven't already
// foldseek databases ProstT5 /path/to/prostt5/weights tmp
unicore createdb data db/proteome_db /path/to/prostt5/weights
```
This will create a Foldseek database in the `db` folder.

If you have foldseek installed with CUDA, you can run the ProstT5 in the module with foldseek by adding `--use-foldseek` option.
If you want to select the GPU devices, please use the `CUDA_VISIBLE_DEVICES` environment variable.

* `CUDA_VISIBLE_DEVICES=0` to use GPU 0.
* `CUDA_VISIBLE_DEVICES=0,1` to use GPU 0 and 1.

#### cluster
`cluster` module takes a `createdb` output database, runs Foldseek clustering, and outputs the cluster results.
Expand Down Expand Up @@ -217,11 +209,9 @@ unicore gene-tree --realign --threshold 30 --name /path/to/hashed/gene/names tre
## Build from Source
### Minimum requirements
* [Cargo](https://www.rust-lang.org/tools/install) (Rust)
* [Foldseek](https://foldseek.com) (version ≥ 9)
* [Foldseek](https://foldseek.com) (version ≥ 10)
* [Foldmason](https://foldmason.foldseek.com)
* [IQ-TREE](http://www.iqtree.org/)
* pytorch, transformers, sentencepiece, protobuf
- These are required for users who cannot build foldseek with CUDA. Please install them with `pip install torch transformers sentencepiece protobuf`.
### Optional requirements
* [MAFFT](https://mafft.cbrc.jp/alignment/software/)
* [Fasttree](http://www.microbesonline.org/fasttree/) or [RAxML](https://cme.h-its.org/exelixis/web/software/raxml/)
Expand All @@ -240,5 +230,5 @@ With these tools installed, you can install and run `unicore` by:
git clone https://github.com/steineggerlab/unicore.git
cd unicore
cargo build --release
bin/unicore help
bin/unicore -v
```
9 changes: 0 additions & 9 deletions src/envs/variables.rs
Original file line number Diff line number Diff line change
Expand Up @@ -74,15 +74,6 @@ pub fn locate_path_cfg() -> String {
err::error(err::ERR_GENERAL, Some("Could not locate path.cfg".to_string()));
}
}
pub fn locate_encoder_py() -> String {
if File::open(format!("{}{}etc{}predict_3Di_encoderOnly.py", parent_dir(), SEP, SEP)).is_ok() {
format!("{}{}etc{}predict_3Di_encoderOnly.py", parent_dir(), SEP, SEP)
} else if File::open(format!("{}{}src{}py{}predict_3Di_encoderOnly.py", src_parent_dir(), SEP, SEP, SEP)).is_ok() {
format!("{}{}src{}py{}predict_3Di_encoderOnly.py", src_parent_dir(), SEP, SEP, SEP)
} else {
err::error(err::ERR_GENERAL, Some("Could not locate path.cfg".to_string()));
}
}

// binary paths
pub const VALID_BINARY: [&str; 8] = [
Expand Down
118 changes: 26 additions & 92 deletions src/modules/createdb.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,18 +26,11 @@ pub fn run(args: &Args, bin: &var::BinaryPaths) -> Result<(), Box<dyn std::error
let overwrite = args.createdb_overwrite.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - overwrite".to_string())); });
let max_len = args.createdb_max_len.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - max_len".to_string())); });
let gpu = args.createdb_gpu.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - gpu".to_string())); });
let use_python = args.createdb_use_python.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - use_python".to_string())); });
let use_foldseek = args.createdb_use_foldseek.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - use_foldseek".to_string())); });
let afdb_lookup = args.createdb_afdb_lookup.unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - afdb_lookup".to_string())); });
let afdb_local = args.createdb_afdb_local.clone().unwrap_or_else(|| { err::error(err::ERR_ARGPARSE, Some("createdb - afdb_local".to_string())); });
let threads = crate::envs::variables::threads();
let foldseek_verbosity = (match var::verbosity() { 4 => 3, 3 => 2, _ => var::verbosity() }).to_string();

// Either use_foldseek or use_python must be true
if !use_foldseek && !use_python {
err::error(err::ERR_ARGPARSE, Some("Either use_foldseek or use_python must be true".to_string()));
}

// Check afdb_lookup
let afdb_local = if afdb_lookup && !afdb_local.is_some() {
err::error(err::ERR_ARGPARSE, Some("afdb-lookup is provided but afdb-local is not given".to_string()));
Expand Down Expand Up @@ -135,44 +128,37 @@ pub fn run(args: &Args, bin: &var::BinaryPaths) -> Result<(), Box<dyn std::error
fasta::write_fasta(&combined_aa, &fasta_data)?;
}

if use_foldseek {
// Added use_foldseek temporarily.
// TODO: Remove use_foldseek when foldseek is ready
let foldseek_path = match &bin.get("foldseek") {
Some(bin) => &bin.path,
_none => { err::error(err::ERR_BINARY_NOT_FOUND, Some("foldseek".to_string())); }
};

// Check if old weights exist
if Path::new(&model).join("cnn.safetensors").exists() || Path::new(&model).join(format!("model{}cnn.safetensors", SEP)).exists() {
err::error(err::ERR_GENERAL, Some("Old weight files detected from the given path. Please provide different path for the model weights".to_string()));
}
// Check if weights exist
if !Path::new(&model).join("prostt5-f16.gguf").exists() {
// Download the model
std::fs::create_dir_all(format!("{}{}tmp", model, SEP))?;
let mut cmd = std::process::Command::new(foldseek_path);
let mut cmd = cmd
.arg("databases").arg("ProstT5").arg(&model).arg(format!("{}{}tmp", model, SEP)).arg("--threads").arg(threads.to_string());
cmd::run(&mut cmd);
}
// Use foldseek to create the database
let foldseek_path = match &bin.get("foldseek") {
Some(bin) => &bin.path,
_none => { err::error(err::ERR_BINARY_NOT_FOUND, Some("foldseek".to_string())); }
};

// Run foldseek createdb
// Check if old weights exist
if Path::new(&model).join("cnn.safetensors").exists() || Path::new(&model).join(format!("model{}cnn.safetensors", SEP)).exists() {
err::error(err::ERR_GENERAL, Some("Old weight files detected from the given path. Please provide different path for the model weights".to_string()));
}
// Check if weights exist
if !Path::new(&model).join("prostt5-f16.gguf").exists() {
// Download the model
std::fs::create_dir_all(format!("{}{}tmp", model, SEP))?;
let mut cmd = std::process::Command::new(foldseek_path);
let cmd = cmd
.arg("createdb").arg(&combined_aa).arg(&output)
.arg("--prostt5-model").arg(&model)
.arg("--threads").arg(threads.to_string());
let mut cmd = if gpu {
cmd.arg("--gpu").arg("1")
} else { cmd };
let mut cmd = cmd
.arg("databases").arg("ProstT5").arg(&model).arg(format!("{}{}tmp", model, SEP)).arg("--threads").arg(threads.to_string());
cmd::run(&mut cmd);
} else if use_python {
let _ = _run_python(&combined_aa, &curr_dir, &parent, &output, &model, keep, bin, threads.to_string());
} else {
err::error(err::ERR_GENERAL, Some("Either use_foldseek or use_python must be true".to_string()));
}

// Run foldseek createdb
let mut cmd = std::process::Command::new(foldseek_path);
let cmd = cmd
.arg("createdb").arg(&combined_aa).arg(&output)
.arg("--prostt5-model").arg(&model)
.arg("--threads").arg(threads.to_string());
let mut cmd = if gpu {
cmd.arg("--gpu").arg("1")
} else { cmd };
cmd::run(&mut cmd);

if afdb_lookup {
let foldseek_path = match &bin.get("foldseek") {
Some(bin) => &bin.path,
Expand Down Expand Up @@ -221,57 +207,5 @@ pub fn run(args: &Args, bin: &var::BinaryPaths) -> Result<(), Box<dyn std::error
chkpnt::write_checkpoint(&checkpoint_file, "1")?;


Ok(())
}

fn _run_python(combined_aa: &String, curr_dir: &str, parent: &str, output: &str, model: &str, keep: bool, bin: &crate::envs::variables::BinaryPaths, threads: String) -> Result<(), Box<dyn std::error::Error>> {
let input_3di = format!("{}{}{}{}combined_3di.fasta", curr_dir, SEP, parent, SEP);
let inter_prob = format!("{}{}{}{}output_probabilities.csv", curr_dir, SEP, parent, SEP);
let output_3di = format!("{}{}{}_ss", curr_dir, SEP, output);
let foldseek_verbosity = (match var::verbosity() { 4 => 3, 3 => 2, _ => var::verbosity() }).to_string();

// Run python script
let mut cmd = std::process::Command::new("python");
let mut cmd = cmd
.arg(var::locate_encoder_py())
.arg("-i").arg(&combined_aa)
.arg("-o").arg(&input_3di)
.arg("--model").arg(&model)
.arg("--half").arg("0")
.arg("--threads").arg(threads);
cmd::run(&mut cmd);

// Build foldseek db
let foldseek_path = match &bin.get("foldseek") {
Some(bin) => &bin.path,
_none => { err::error(err::ERR_BINARY_NOT_FOUND, Some("foldseek".to_string())); }
};
let mut cmd = std::process::Command::new(foldseek_path);
let mut cmd = cmd
.arg("base:createdb").arg(&combined_aa).arg(&output)
.arg("--shuffle").arg("0")
.arg("-v").arg(foldseek_verbosity.as_str());

cmd::run(&mut cmd);

// Build foldseek 3di db
let mut cmd = std::process::Command::new(foldseek_path);
let mut cmd = cmd
.arg("base:createdb").arg(&input_3di).arg(&output_3di)
.arg("--shuffle").arg("0")
.arg("-v").arg(foldseek_verbosity.as_str());
cmd::run(&mut cmd);

// Delete intermediate files
if !keep {
// std::fs::remove_file(mapping_file)?;
// std::fs::remove_file(combined_aa)?;
std::fs::remove_file(input_3di)?;
std::fs::remove_file(inter_prob)?;
}

// // Write the checkpoint file
// chkpnt::write_checkpoint(&format!("{}/createdb.chk", parent), "1")?;

Ok(())
}
Loading