blast workflow

systemPipeR · Jul 31, 2023 · d7759a2 · d7759a2
1 parent 3f17818
commit d7759a2
Show file tree

Hide file tree

Showing 17 changed files with 3,417 additions and 138 deletions.
diff --git a/inst/extdata/workflows/SPblast/.batchtools.conf.R b/inst/extdata/workflows/SPblast/.batchtools.conf.R
@@ -0,0 +1,2 @@
+cluster.functions <- makeClusterFunctionsSlurm(template="batchtools.slurm.tmpl")                      
+
diff --git a/inst/extdata/workflows/SPblast/SPblast.Rmd b/inst/extdata/workflows/SPblast/SPblast.Rmd
@@ -0,0 +1,292 @@
+---
+title: "NCBI BLAST" 
+author: "Author: FirstName LastName"
+date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" 
+output:
+  BiocStyle::html_document:
+    toc_float: true
+    code_folding: show
+  BiocStyle::pdf_document: default
+package: systemPipeR
+vignette: |
+  %\VignetteIndexEntry{RIBO-Seq Workflow Template}
+  %\VignetteEncoding{UTF-8}
+  %\VignetteEngine{knitr::rmarkdown}
+fontsize: 14pt
+bibliography: bibtex.bib
+editor_options: 
+  chunk_output_type: console
+---
+
+<!--
+Config css and r style
+-->
+
+```{css, echo=FALSE}
+pre code {
+white-space: pre !important;
+overflow-x: scroll !important;
+word-break: keep-all !important;
+word-wrap: initial !important;
+}
+```
+
+```{r style, echo = FALSE, results = 'asis'}
+BiocStyle::markdown()
+options(width=60, max.print=1000)
+knitr::opts_chunk$set(
+    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
+    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")), 
+    tidy.opts=list(width.cutoff=60), tidy=TRUE)
+```
+
+```{r setup, echo=FALSE, message=FALSE, warning=FALSE, eval=FALSE}
+suppressPackageStartupMessages({
+    library(systemPipeR)
+})
+```
+
+# About the template
+This section provides general description and how to use this cheminformatics workflow. In the actual analysis report, this 
+section is usually **removed**. 
+
+This BLAST workflow template is based on the 
+[BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) based R package [rBLAST](https://github.com/mhahsler/rBLAST).
+
+- The BLAST software can be downloaded from [NCBI](https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/).
+  Please make sure it can be run from command-line.
+- rBLAST can be installed with `install.packages('rBLAST', repos = 'https://mhahsler.r-universe.dev')`.
+
+This workflow does:
+1. Validate the BLAST installation
+2. BLAST input fasta file with a reference genome. 
+3. BLAST input with a certain database
+4. BLAST sequence with general databases to find out the source organism(s).
+
+All are written in R (`Linewise`) steps, but _BLAST+_ must be installed.
+
+![](results/plotwf_spblast.jpg)
+
+# Introduction
+
+Users want to provide here background information about the design of their
+cheminformatics project.
+
+This report describes the analysis of a BLAST project studying drug ...
+
+## Experimental design
+
+Typically, users want to specify here all information relevant for the
+analysis of their BLAST study. This includes detailed descriptions of
+files, experimental design, reference genome, gene annotations,
+etc.
+
+# Workflow environment
+
+_`systemPipeR`_ workflows can be designed and built from start to finish with a 
+single command, importing from an R Markdown file or stepwise in interactive 
+mode from the R console. 
+
+This tutorial will demonstrate how to build the workflow in an interactive mode, 
+appending each step. The workflow is constructed by connecting each step via 
+`appendStep` method. Each `SYSargsList` instance contains instructions needed 
+for processing a set of input files with a specific command-line or R software 
+and the paths to the corresponding outfiles generated by a particular tool/step. 
+
+To create a Workflow within _`systemPipeR`_, we can start by defining an empty
+container and checking the directory structure:
+
+```{r create_workflow, message=FALSE, eval=FALSE}
+library(systemPipeR)
+sal <- SPRproject()
+sal
+```
+
+## Load packages 
+
+This is an empty template that contains only one demo step. 
+Refer to our [website](https://systempipe.org/sp/spr/spr_run/) for how to 
+add more steps. If you prefer a more enriched template, 
+[read this page](https://systempipe.org/sp/spr/templates/) for other 
+pre-configured templates.
+
+```{r load_packages, eval=FALSE, spr=TRUE}
+cat(crayon::blue$bold("To use this workflow, following R packages are expected:\n"))
+cat(c("'rBLAST", "readr\n"), sep = "', '")
+###pre-end
+appendStep(sal) <- LineWise(
+    code = {
+        library(systemPipeR)
+        library(rBLAST)
+    }, 
+    step_name = "load_packages"
+)
+```
+
+## Test BLAST install
+Molecules can be loaded or downloaded. This example dataset has 100 molecules.
+```{r test_blast, eval=FALSE, spr=TRUE}
+# Here, the dataset is downloaded. If you already have the data locally, change URL to local path.
+appendStep(sal) <- LineWise(
+    code = {
+        # If you have a modular system, use following line
+        moduleload("ncbi-blast") 
+        # If not, comment out line above you need to install BLAST and configure the PATH. 
+        blast_check <- tryCMD("blastn", silent = TRUE)
+        if(blast_check  == "error") stop("Check your BLAST installation path.")
+    }, 
+    step_name = "test_blast",
+    dependency = "load_packages"
+)
+```
+
+## Load query sequence
+Load query sequence from a `fasta` file.
+
+In this template, an example fasta is provided, with 10 sequences from Arabidopsis, Cholera, Human, 
+Mouse, and COVID-19, 2 for each. 
+```{r load_query, eval=FALSE, spr=TRUE}
+appendStep(sal) <- LineWise(
+    code = {
+        query <- readDNAStringSet('data/example.fasta')
+    }, 
+    step_name = "load_query",
+    dependency = "test_blast"
+)
+```
+
+
+## BLAST against reference genome
+In this step, we are trying to BLAST the query sequences to a reference genome and see if this genome contains the whole or part of the sequences.
+
+In this example, a minimized `tair10` genome is used. In the real analysis, please replace it with 
+a full genome `fasta` file. 
+
+```{r build_genome_db, eval=FALSE, spr=TRUE}
+
+appendStep(sal) <- LineWise(
+    code = {
+        reference <- 'data/tair10.fasta'
+        # this command prepare BLAST-able database of genome
+        makeblastdb(reference, dbtype='nucl')
+    }, 
+    step_name = "build_genome_db",
+    dependency = "load_query"
+)
+```
+
+Next BLAST is performed. Since there are only 2 Arabidopsis sequences in the example `fasta`. Only these two sequences are expected to return statistically meaningful BLAST results.
+
+```{r blast_genome, eval=FALSE, spr=TRUE}
+appendStep(sal) <- LineWise(
+    code = {
+        bl_tair10 <- blast(db = reference, type='blastn')
+        cl_tair10 <- predict(bl_tair10, query)
+        readr::write_csv(cl_tair10, "results/blast_tair10.csv")
+    }, 
+    step_name = "blast_genome",
+    dependency = "build_genome_db"
+)
+```
+
+
+## BLAST existing databases
+There are plenty of databases on [NCBI](https://ftp.ncbi.nlm.nih.gov/blast/db/) that one could 
+download and run BLAST on. Once the databases are downloaded, unzip all files into one directory. We need to provide the path to the database. 
+
+In this example, we want to know if COVID-19 is a beta coronavirus. Then, we can use some COVID sequence to BLAST all other existing beta coronavirus sequences and find the similarity. This resource is downloadable from NCBI. All downloaded 
+`Betacoronavirus.XX.tar.gz` files are unzipped to `/srv/projects/db/ncbi/preformatted/20220131/`. Please change the path according to your project. Then, we can BLAST the last two sequence against
+the database.
+
+
+```{r blast_db, eval=FALSE, spr=TRUE}
+appendStep(sal) <- LineWise(
+    code = {
+        bl_covid <- blast(
+            db = '/srv/projects/db/ncbi/preformatted/20220131/Betacoronavirus', 
+            type='blastn'
+        )
+        cl_covid <- predict(bl_covid, query[9:10])
+        readr::write_csv(cl_covid, "results/blast_covid.csv")
+    }, 
+    step_name = "blast_db",
+    dependency = "load_query"
+)
+```
+
+## BLAST to general databases
+Sometimes we do not know the origin of a sequence, for example, a sequence comes from a contaminated sample, and we want to know the source. In such cases, we would need to BLAST the sequence to a more generic database. The most generic nucleotide BLAST database is the `nt` database. 
+
+This database is extremely big and requires giant RAM and CPU cores to run. Please do not run the following example unless your system admin has provided you such store space and computational power. A better way for average the user is to use the website
+https://blast.ncbi.nlm.nih.gov/blast/Blast.cgi/ . The engine over there is optimized and can quickly
+search for the species information. 
+
+
+```{r blast_nt, eval=FALSE, spr=TRUE}
+appendStep(sal) <- LineWise(
+    code = {
+        bl_nt <- blast(db = '/srv/projects/db/ncbi/preformatted/20220131/nt', type='blastn')
+        cl_nt <- predict(bl_nt, query[5])
+        readr::write_csv(cl_nt, "results/blast_nt.csv")
+    }, 
+    step_name = "blast_nt",
+    dependency = "load_query",
+    run_step = "optional"
+)
+```
+
+
+## Workflow session
+
+```{r wf_session, eval=FALSE, spr=TRUE}
+appendStep(sal) <- LineWise(
+    code = {
+        sessionInfo()
+    }, 
+    step_name = "wf_session", 
+    dependency = "blast_db")
+```
+
+# Manage the workflow
+
+To run the workflow, use `runWF` function. It executes all the steps store in 
+the workflow container. The execution will be on a single machine without 
+submitting to a queuing system of a computer cluster. 
+
+```{r runWF, eval=FALSE}
+sal <- runWF(sal, run_step = "mandatory") # remove `run_step` to run all steps to include optional steps
+```
+
+- To use complex workflow control options, such as parallelization, subsetting samples, selecting steps, read the [documents](https://systempipe.org/sp/spr/sp_run/step_run/) on our website. 
+- Explore [other details of the workflow object](https://systempipe.org/sp/spr/sp_run/sal_explore/).
+- Create [logs and reports](https://systempipe.org/sp/spr/sp_run/step_reports/).
+- [Visualize the workflow](https://systempipe.org/sp/spr/sp_run/step_vis/).
+
+# About the workflow
+## Tools used 
+To check command-line tools used in this workflow, use `listCmdTools`, and use `listCmdModules`
+to check if you have a modular system.
+
+The following code will print out tools required in your custom SPR project in the report. 
+In case you are running the workflow for the first and do not have a project yet, or you 
+just want to browser this workflow, following code displays the tools required by default.
+```{r list_tools}
+if(file.exists(file.path(".SPRproject", "SYSargsList.yml"))) {
+    local({
+        sal <- systemPipeR::SPRproject(resume = TRUE)
+        systemPipeR::listCmdTools(sal)
+        systemPipeR::listCmdModules(sal)
+    })
+} else {
+    cat(crayon::blue$bold("Tools and modules required by this workflow are:\n"))
+    cat(c("BLAST 2.14.0+"), sep = "\n")
+}
+```
+
+
+## Session Info
+This is the session information for rendering this report. To access the session information
+of workflow running, check HTML report of `renderLogs`. 
+```{r report_session_info, eval=TRUE}
+sessionInfo()
+```
diff --git a/inst/extdata/workflows/SPblast/SPblast.html b/inst/extdata/workflows/SPblast/SPblast.html
diff --git a/inst/extdata/workflows/SPblast/batchtools.slurm.tmpl b/inst/extdata/workflows/SPblast/batchtools.slurm.tmpl
@@ -0,0 +1,51 @@
+#!/bin/bash
+
+## Job Resource Interface Definition
+##
+## ntasks [integer(1)]:       Number of required tasks,
+##                            Set larger than 1 if you want to further parallelize
+##                            with MPI within your job.
+## ncpus [integer(1)]:        Number of required cpus per task,
+##                            Set larger than 1 if you want to further parallelize
+##                            with multicore/parallel within each task.
+## walltime [integer(1)]:     Walltime for this job, in minutes.
+##                            Must be at least 1 minute.
+## memory   [integer(1)]:     Memory in megabytes for each cpu.
+##                            Must be at least 100 (when I tried lower values my
+##                            jobs did not start at all).
+##
+## Default resources can be set in your .batchtools.conf.R by defining the variable
+## 'default.resources' as a named list.
+
+<%
+# relative paths are not handled well by Slurm
+log.file = normalizePath(log.file, winslash = "/", mustWork = FALSE)
+-%>
+
+
+#SBATCH --job-name=<%= job.name %>
+#SBATCH --output=<%= log.file %>
+#SBATCH --error=<%= log.file %>
+#SBATCH --time=<%= ceiling(resources$walltime / 1) %>
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=<%= resources$ncpus %>
+#SBATCH --mem-per-cpu=<%= resources$memory %>
+<%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %>
+<%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>
+
+## specify which queue on biocluster, one of 'batch', 'highmem', 'intel', 'gpu', 'mygroup'
+#SBATCH -p short
+##SBATCH -p batch
+##SBATCH -p intel 
+
+## Initialize work environment like
+## source /etc/profile
+## module add ...
+
+## Export value of DEBUGME environemnt var to slave
+export DEBUGME=<%= Sys.getenv("DEBUGME") %>
+
+## Run R:
+## we merge R output with stdout from SLURM, which gets then logged via --output option
+Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
+