This is a RNA Seq pipeline based on STAR and Salmon designed to run on Slurm clusters
- Python 3
- SLURM cluster
- STAR >= 2.70 (https://github.com/alexdobin/STAR)
- Salmon 0.13.1 (https://anaconda.org/bioconda/salmon/files)
- trim_galore (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)
- htseq (https://htseq.readthedocs.io/en/master/)
This set of script files is designed to be executed in the same directory. From the directory these files are located in, you can execute
$ ./submit_dir.sh <genomedata> <data-root> <resultdir>
For each analysis project We need to have a genome folder that was aligned and indexed with STAR. The file should therefore contain
- FASTA file
- SA, SAIndex, chromosome and Genome files generated by STAR
- GFF file (optional, for htseq-count)
$ ./submit_dir.sh <genomedata> <data-root> <resultdir>
The submit_dir.sh command takes a data root directory and a result directory
as parameters.
The data root directory is the directory that contains the directories that
contain the FASTQ files. Typically those will follow the naming scheme
"R[something]", e.g. "R123" which will contain something like "R123_1.fq.gz" and
"R123_2.fq.gz".
The submit_dir.sh file scans the data root directory
for every directory that starts with a capital "R", creates a SLURM job for it and
submits it to the cluster.
After all the runs are finished, the results are expected in the result directory each in their corresponding "R..." directories.
You can run MultiQC to obtain an overview of the resulting run
$ multiqc <resultdir>