diff --git a/README.md b/README.md index c67cf4e..0077ccf 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ tutorial bwa ``` ## Install and Prepare BWA -First, we need to install BWA, also called Burrows-Wheeler Aligner. To do this, we will create and navigate to a new folder in our /home directory called `software`. We will then follow the developer's instructions (https://github.com/lh3/bwa) for using `git clone` to clone the software and then build the tool using `make`. +First, we need to install BWA, also called Burrows-Wheeler Aligner. To do this, we will create and navigate to a new folder in our `/home` directory called `software`. We will then follow the developer's instructions (https://github.com/lh3/bwa) for using `git clone` to clone the software and then build the tool using `make`. ``` cd ~/tutorial-bwa @@ -54,7 +54,7 @@ cd ~/tutorial-bwa/software tar -czvf bwa.tar.gz bwa ``` -Checking the size of this compressed tarball using `ls -lh bwa.tar.gz` reveals the file is approximately 4MB. The tarball should stay in /home. +Checking the size of this compressed tarball using `ls -lh bwa.tar.gz` reveals the file is approximately 4MB. Based off of the [Overview: Data Staging and Transfer to Jobs](https://support.opensciencegrid.org/support/solutions/articles/12000002985-overview-data-staging-and-transfer-to-jobs) guide, this size of this tarball should stay in `/home`. ## Download Data to Analyze @@ -65,19 +65,21 @@ Now that we have installed BWA, we need to download data to analyze. For this tu cd ~/tutorial-bwa ./download_data.sh ``` + Investigating the size of the downloaded genome by typing: ``` ls -lh data/ref_genome/ ``` -reveals the file is 1.4 MB. Therefore, this file should remain in /home and does not need to be moved to /public. We should also check the trimmed fastq paired-end read files: +reveals the file is 1.4 MB. Therefore, this file should remain in `/home` and does not need to be moved to `/public`. We should also check the trimmed fastq paired-end read files: ``` ls -lh data/trimmed_fastq_small ``` Once everything is downloaded, make sure you're still in the `tutorial-bwa` directory. + ``` cd ~/tutorial-bwa ``` @@ -110,7 +112,8 @@ requirements = (OSGVO_OS_STRING == "RHEL 7") queue 1 ``` -You will notice that the .log, .out, and .error files will be saved to a folder called `TestJobOutput`. We need to create this folder using `mkdir TestJobOutput` before we submit our job. + +You will notice that the `.log`, `.out`, and `.error` files will be saved to a folder called `TestJobOutput`. We need to create this folder using `mkdir TestJobOutput` before we submit our job. We will call the script for this analysis `bwa-test.sh` and it should contain the following information: @@ -148,6 +151,8 @@ condor_submit bwa-test.sub To check the status of the job, we can use `condor_q`. +# Optimize Resource Requests + Upon the completion of the test job, we should investigate the output to ensure that it is what we expected and also review the `.log` file to help optimize future resource requests in preparation for scaling up. For example, when we investigate the `bwa_test_job.log` file created in this analysis, at the bottom of the file we see a resource table: @@ -167,7 +172,7 @@ Here we see that we used less than half of both the disk space and memory we req In preparation for scaling up, please review our [guide on how to scale up after a successful test job](https://support.opensciencegrid.org/support/solutions/articles/12000076552-scaling-up-after-success-with-test-jobs) and how to [easily submit multiple jobs with a single submit file](https://support.opensciencegrid.org/support/solutions/articles/12000073165-easily-submit-multiple-jobs). -After reviewing how to submit multiple jobs with a single submit file, it is possible to determine that the most appropriate way to submit multiple jobs for this analysis is to use `queue from `. +After reviewing how to submit multiple jobs with a single submit file, we see that the most appropriate way to submit multiple jobs for this analysis is to use `queue from ` because we want HTCondor to queue an independent job to analyze each of our biological samples. To use this option, we first need to create a file with just the sample names/IDs that we want to analyze. To do this, we want to cut all information after the "_" symbol to remove the forward/reverse read information and file extensions. For example, we want SRR2584863_1.trim.sub.fastq to become just SRR2584863. @@ -207,7 +212,7 @@ requirements = (OSGVO_OS_STRING == "RHEL 7") queue sample from data/trimmed_fastq_small/samples.txt ``` -In addition to restructuring our submit file to queue a new job for each sample, it is also advantageous to have our standard output, log, and error files saved to dedicated folders called "log", "output", and "error" to help keep our output files organized. Therefore, we need to make these folders in our /home directory prior to submitting our job. We will also create an additional folder to store our aligned sequencing files in a folder called `results`: +In addition to restructuring our submit file to queue a new job for each sample, it is also advantageous to have our standard output, log, and error files saved to dedicated folders called "log", "output", and "error" to help keep our output files organized. Therefore, we need to make these folders in our `/home` directory prior to submitting our job. We will also create an additional folder called `results` to store our aligned sequencing file output: ``` mkdir log @@ -216,7 +221,7 @@ mkdir error mkdir results ``` -To store the aligned sequencing files in the `results` folder, we can add the `transfer_output_remaps` feature to our submit file. This feature allows us to specify a name and a path to save our output files in the format of "file1 = path/to/save/file2", where file1 is the origional name of the document and file2 is the name that we want to save the file using. In the example above, we do not change the name of the resulting output files. This feature also helps us keep an organized working space, rather than having all of our resulting sequencing files be saved to our /home directory. +To store the aligned sequencing files in the `results` folder, we can add the `transfer_output_remaps` feature to our submit file. This feature allows us to specify a name and a path to save our output files in the format of `transfer_output_remaps = "file1 = path/to/save/file2"`, where file1 is the origional name of the document and file2 is the name that we want to save the file using. In the example above, we do not change the name of the resulting output files. `Transfer_output_remaps` also helps us keep an organized working space by having our analysis output files saved to a `/results` folder within `/home` , rather than having all of our resulting sequencing files be saved to our main `/home` directory. Once our submit file has been updated, we can update our script to look like and call it something like `bwa-alignment.sh`: diff --git a/tutorial-instructions.md b/tutorial-instructions.md new file mode 100644 index 0000000..9a55c2b --- /dev/null +++ b/tutorial-instructions.md @@ -0,0 +1,316 @@ +[TOC] + + +## Introduction + +This tutorial focuses on a subset of the [Data Carpentry Genomics workshop curriculum](https://datacarpentry.org/genomics-workshop/) - specifically, this page cover's how to run a BWA workflow on OSG resources. It will use the same general flow as the BWA segment of the Data Carpentry workshop with minor adjustments. The goal of this tutorial is to learn how to convert an existing BWA workflow to run on the OSPool. + +## Get Tutorial Files + +Logged into the submit node, we will run the tutorial command, that will +create a folder for our analysis, as well as some sample files. + +``` +tutorial bwa +``` + +## Install and Prepare BWA +First, we need to install BWA, also called Burrows-Wheeler Aligner. To do this, we will create and navigate to a new folder in our `/home` directory called `software`. We will then follow the developer's instructions (https://github.com/lh3/bwa) for using `git clone` to clone the software and then build the tool using `make`. + +``` +cd ~/tutorial-bwa +cd software +git clone https://github.com/lh3/bwa.git +cd bwa +make +``` + +Next, BWA needs to be added to our PATH variables, to test if the installation worked: + +``` +export PATH=$PATH:/home/$USER/tutorial-bwa/software/bwa/ +``` + +To check that BWA has been installed correctly, type `bwa`. You should receive output similar to the following: + +``` +Program: bwa (alignment via Burrows-Wheeler transformation) +Version: 0.7.17-r1198-dirty +Contact: Heng Li + +Usage: bwa [options] + +Command: index index sequences in the FASTA format + mem BWA-MEM algorithm + fastmap identify super-maximal exact matches +... + +``` + +Now that we have successfully installed `bwa`, we will create a portable compressed tarball of this software so that it is smaller and quicker to transport when we submit our jobs to the OSPool. + +``` +cd ~/tutorial-bwa/software +tar -czvf bwa.tar.gz bwa +``` + +Checking the size of this compressed tarball using `ls -lh bwa.tar.gz` reveals the file is approximately 4MB. Based off of the [Overview: Data Staging and Transfer to Jobs](https://support.opensciencegrid.org/support/solutions/articles/12000002985-overview-data-staging-and-transfer-to-jobs) guide, this size of this tarball should stay in `/home`. + + +## Download Data to Analyze + +Now that we have installed BWA, we need to download data to analyze. For this tutorial, we will be downloading data used in the Data Carpentry workshop. This data includes both the genome of Escherichia coli (E. coli) and paired-end RNA sequencing reads obtained from a study carried out by Blount et al. published in [PNAS](http://www.pnas.org/content/105/23/7899). Additional information about how the data was modified in preparation for this analysis can be found on the [Data Carpentry's workshop website](https://datacarpentry.org/wrangling-genomics/aio.html). + +``` +cd ~/tutorial-bwa +./download_data.sh +``` + +The `./download_data.sh` script will create two new folders inside the `/data` directory called `/ref_genome` and `/trimmed_fastq_small`. This script will download our E. coli genome file into the `/ref_genome` directory and will download short paired-end RNA sequencing files into the `/trimmed_fastq_small` directory. + +Once this executable finishes running, which may take several minutes, we can investigate the size of the downloaded genome file by typing: + +``` +ls -lh data/ref_genome/ +``` + +reveals the file is 1.4 MB. Therefore, this file should remain in `/home` and does not need to be moved to `/public`. We should also check the trimmed fastq paired-end read files: + +``` +ls -lh data/trimmed_fastq_small +``` + +We see that these files are all approximately 50 MB, which means they should remain in our `/home` directory. + +Once everything is downloaded, make sure you're still in the `tutorial-bwa` directory. + +``` +cd ~/tutorial-bwa +``` + +## Run a Single Test Job + +Now that we have all items in our analysis ready, it is time to submit a single test job to map our RNA reads to the E. coli genome. For a single test job, we will choose a single sample to analyze. In the following example, we will align both the forward and reverse reads of SRR2584863 to the E. coli genome. Using a text editor such as `nano` or `vim`, we will open our example submit file for this test job called `bwa-test.sub`, which contains the following information: + +``` +universe = vanilla +executable = +# arguments = + +# need to transfer bwa.tar.gz file, the reference +# genome, and the trimmed fastq files +transfer_input_files = + +log = TestJobOutput/bwa_test_job.log +output = TestJobOutput/bwa_test_job.out +error = TestJobOutput/bwa_test_job.error + ++JobDurationCategory = "Medium" +request_cpus = 1 +request_memory = 2GB +request_disk = 1GB + +should_transfer_files = YES +when_to_transfer_output = ON_EXIT +requirements = (OSGVO_OS_STRING == "RHEL 7") + +queue 1 +``` + +Let's fill in exeutable for this job so that HTCondor knows what to run: + +``` +executable = bwa-test.sh +``` + +Next, let's edit the `transfer_input_files` line so that HTCondor knows which files to transfer to the job: + +``` +transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/trimmed_fastq_small/SRR2584863_1.trim.sub.fastq, data/trimmed_fastq_small/SRR2584863_2.trim.sub.fastq +``` + +You will notice that the `.log`, `.out`, and `.error` files will be saved to a folder called `TestJobOutput`. We need to save our changes to the submit file, exit our text editor, and create this folder using `mkdir TestJobOutput` before we submit our job. + +Once we have prepared our submit file, we will prepare our `bwa-test.sh`. Currently, our executable looks like: + +``` +#!/bin/bash +# Script name: bwa-test.sh + +echo "Unpacking software" + +echo "Setting PATH for bwa" + +echo "Indexing E. coli genome" + +echo "Starting bwa alignment for SRR2584863" + +echo "Done with bwa alignment for SRR2584863!" + +echo "Cleaning up files generated from genome indexing" +rm ecoli_rel606.fasta.gz.amb +rm ecoli_rel606.fasta.gz.ann +rm ecoli_rel606.fasta.gz.bwt +rm ecoli_rel606.fasta.gz.pac +rm ecoli_rel606.fasta.gz.sa +``` + +We need to edit it to contain the command to unpack our BWA software (`tar -xzf bwa.tar.gz`) and to set the PATH to the bwa executable (`export PATH=$_CONDOR_SCRATCH_DIR/bwa/:$PATH`), which is a modified version of our earlier command to set the PATH variable. Lastly, we need to add example commands for BWA to analyze our paired-end RNA sequencing files. An example executable may look like: + +``` +#!/bin/bash +# Script name: bwa-test.sh + +echo "Unpacking software" +tar -xzf bwa.tar.gz + +echo "Setting PATH for bwa" +export PATH=$_CONDOR_SCRATCH_DIR/bwa/:$PATH + +echo "Indexing E. coli genome" +bwa index ecoli_rel606.fasta.gz + +echo "Starting bwa alignment for SRR2584863" +bwa mem ecoli_rel606.fasta.gz SRR2584863_1.trim.sub.fastq SRR2584863_2.trim.sub.fastq > SRR2584863.aligned.sam + +echo "Done with bwa alignment for SRR2584863!" + +echo "Cleaning up files generated from genome indexing" +rm ecoli_rel606.fasta.gz.amb +rm ecoli_rel606.fasta.gz.ann +rm ecoli_rel606.fasta.gz.bwt +rm ecoli_rel606.fasta.gz.pac +rm ecoli_rel606.fasta.gz.sa +``` + +Once done, we can submit this single test job to HTCondor by typing: + +``` +condor_submit bwa-test.sub +``` + +To check the status of the job, we can use `condor_q`. + +# Optimize Resource Requests + +Upon the completion of the test job, we should investigate the output to ensure that it is what we expected and also review the `.log` file to help optimize future resource requests in preparation for scaling up. + +For example, when we investigate the `bwa_test_job.log` file created in this analysis, at the bottom of the file we see a resource table: + +``` + Partitionable Resources : Usage Request Allocated + Cpus : 1 1 + Disk (KB) : 253770 1048576 27945123 + Memory (MB) : 144 2048 2500 +``` + +Here we see that we used less than half of both the disk space and memory we requested. In future jobs, we should request a smaller amount of each resource, such as 0.5 GB of disk space and 0.5 GB of memory. Prior to scaling up our analysis, we should run additional test jobs using these resource requests to ensure that they are sufficient to allow our job to complete successfully. + + +## Scaling Up to Analyze Multiple Samples + +In preparation for scaling up, please review our [guide on how to scale up after a successful test job](https://support.opensciencegrid.org/support/solutions/articles/12000076552-scaling-up-after-success-with-test-jobs) and how to +[easily submit multiple jobs with a single submit file](https://support.opensciencegrid.org/support/solutions/articles/12000073165-easily-submit-multiple-jobs). + +After reviewing how to submit multiple jobs with a single submit file, we see that the most appropriate way to submit multiple jobs for this analysis is to use `queue from ` because we want HTCondor to queue an independent job to analyze each of our biological samples. + +To use this option, we first need to create a file with just the sample names/IDs that we want to analyze. To do this, we want to cut all information after the "_" symbol to remove the forward/reverse read information and file extensions. For example, we want SRR2584863_1.trim.sub.fastq to become just SRR2584863. Note, this step is just an example of one way to prepare your workflow to automatically queue multiple jobs - there are several other approaches that could have alternatively been used. + +We will save the sample names in a file called `samples.txt` and view the contents of this file using `cat samples.txt`: + +``` +cd ~/tutorial-bwa +cd data/trimmed_fastq_small/ +ls *.fastq | cut -f 1 -d '_' | uniq > samples.txt +cat samples.txt +cd ~/tutorial-bwa +``` + +Now, we can create a new submit file called `many-bwa-alignment.sub` to queue a new job for each sample. To make it simpler to start, you can copy the `bwa-test.sub` file (`cp bwa-test.sub many-bwa-alignment.sub`) and modify it. By editing the `queue` statement to be something like `queue sample from data/trimmed_fastq_small/samples.txt`, we can then call the `sample` variable elsewhere in our submit file by using `$(sample). + +In addition to restructuring our submit file to queue a new job for each sample, it is also advantageous to have our standard output, log, and error files saved to dedicated folders called "log", "output", and "error" to help keep our many output files organized. Therefore, we need to make these folders in our `/home` directory prior to submitting our job. We will also create an additional folder called `results` to store our aligned sequencing file output: + +``` +mkdir log +mkdir output +mkdir error +mkdir results +``` + +To store the aligned sequencing files in the `results` folder, we can add the `transfer_output_remaps` feature to our submit file. This feature allows us to specify a name and a path to save our output files in the format of `transfer_output_remaps = "file1 = path/to/save/file2"`, where file1 is the origional name of the document and file2 is the name that we want to save the file using. In the example above, we do not change the name of the resulting output files. `Transfer_output_remaps` also helps us keep an organized working space by having our analysis output files saved to a `/results` folder within `/home`, rather than having all of our resulting sequencing files be saved to our main `/home` directory. + +An example submit file containing these changes, along with our decreased memory and disk space requests, is: + +``` +universe = vanilla +executable = many-bwa-alignment.sh +arguments = $(sample) + +transfer_input_files = software/bwa.tar.gz, data/ref_genome/ecoli_rel606.fasta.gz, data/trimmed_fastq_small/$(sample)_1.trim.sub.fastq, data/trimmed_fastq_small/$(sample)_2.trim.sub.fastq + +transfer_output_remaps = "$(sample).aligned.sam=results/$(sample).aligned.sam" + +log = log/bwa_$(sample)_job.log +output = output/bwa_$(sample)_job.out +error = error/bwa_$(sample)_job.error + ++JobDurationCategory = "Medium" +request_cpus = 1 +request_memory = 0.5GB +request_disk = 0.5GB + +should_transfer_files = YES +when_to_transfer_output = ON_EXIT +requirements = (OSGVO_OS_STRING == "RHEL 7") + +queue sample from data/trimmed_fastq_small/samples.txt +``` + +Once our submit file has been updated, we also want to make sure to update our executable such that when we queue a job for each sample, our script will know how to analyze the different input sequencing files. Let's copy our `bwa-test.sh` file and create a new executable called `many-bwa-alignment.sh` by typing `cp bwa-test.sh many-bwa-alignment.sh` and edit the file to look like: + +``` +#!/bin/bash +# Script name: many-bwa-alignment.sh + +echo "Unpackage software" +tar -xzf bwa.tar.gz + +echo "Set PATH for bwa" +export PATH=$_CONDOR_SCRATCH_DIR/bwa/:$PATH + +# Renaming first argument +sample=$1 + +echo "Index E.coli genome" +bwa index ecoli_rel606.fasta.gz + +echo "Starting bwa alignment for ${sample}" +bwa mem ecoli_rel606.fasta.gz ${sample}_1.trim.sub.fastq ${sample}_2.trim.sub.fastq > ${sample}.aligned.sam + +echo "Done with bwa alignment for ${sample}!" + +echo "Cleaning up workspace" +rm ecoli_rel606.fasta.gz.amb +rm ecoli_rel606.fasta.gz.ann +rm ecoli_rel606.fasta.gz.bwt +rm ecoli_rel606.fasta.gz.pac +rm ecoli_rel606.fasta.gz.sa +``` + +Once ready, we can submit our job to HTCondor by using `condor_submit many-bwa-alignment.sub`. + +When we type `condor_q`, we see that three jobs have entered the queue (one for each of our three experimental samples). + +When our jobs are completed, we can confirm that our alignment output results files were created by typing: + +``` +ls -lh results/* +``` + +We should see our bwa results files (SRR2584863.aligned.sam, SRR2584866.aligned.sam, SRR2589044.aligned.sam). + +We can also investigate our log, error, and output files in their respective folders to ensure we obtained the resulting output of these files that we expected. + + + +_For more information about running bioinformatics workflows on the OSG, we recommend our [BLAST tutorial](https://support.opensciencegrid.org/support/solutions/articles/12000062020-running-a-blast-workflow) as well as our [Samtools](https://support.opensciencegrid.org/support/solutions/articles/12000074984-example-software-compilation) instillation guide._