abhirichster

RNA-seq Data Analysis Pipeline

This repository provides a comprehensive guide to RNA sequencing (RNA-seq) data analysis steps, from raw sequencing reads to biological insights.

Overview

RNA-seq is a powerful technique for studying transcriptomes, enabling quantification of gene expression, discovery of novel transcripts, and identification of alternative splicing events. This guide covers the complete analysis workflow.

Analysis Pipeline Steps

1. Experimental Design and Sequencing

Purpose: Design the experiment and generate sequencing data

Key Considerations:

Sample size: Minimum 3 biological replicates per condition (more is better)
Sequencing depth: 20-30 million reads for human samples
Read type: Single-end (SE) or Paired-end (PE)
Read length: Typically 50-150 bp
Library preparation: Poly-A selection (mRNA) or rRNA depletion (total RNA)

Output: FASTQ files containing raw sequencing reads

2. Quality Control

Purpose: Assess the quality of raw sequencing reads

Tools:

FastQC
MultiQC

Key Metrics:

Per base sequence quality
Per sequence quality scores
Sequence length distribution
GC content
Adapter contamination
Overrepresented sequences
Duplication levels

Commands:

# Run FastQC on all samples
fastqc *.fastq.gz -o fastqc_results/

# Aggregate results with MultiQC
multiqc fastqc_results/ -o multiqc_report/

Expected Output: Quality reports identifying any issues with sequencing data

3. Read Trimming and Filtering

Purpose: Remove low-quality bases, adapters, and contaminating sequences

Tools:

Trimmomatic
Cutadapt
fastp
TrimGalore

Operations:

Remove adapter sequences
Trim low-quality bases (typically Q < 20)
Remove reads below minimum length
Filter out rRNA contamination (if applicable)

Example (Trimmomatic):

# For paired-end reads
trimmomatic PE -threads 8 \
  sample_R1.fastq.gz sample_R2.fastq.gz \
  sample_R1_trimmed.fastq.gz sample_R1_unpaired.fastq.gz \
  sample_R2_trimmed.fastq.gz sample_R2_unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Output: Cleaned, high-quality FASTQ files

4. Read Alignment

Purpose: Map reads to a reference genome or transcriptome

Tools:

Splice-aware aligners (for genome alignment):
- STAR (fast, accurate)
- HISAT2 (memory efficient)
- TopHat2 (older, less common now)
Transcriptome aligners:
- Salmon (pseudo-alignment)
- Kallisto (pseudo-alignment)
- Bowtie2 (for transcriptome)

STAR Example:

# Build genome index (one-time step)
STAR --runMode genomeGenerate \
  --genomeDir genome_index/ \
  --genomeFastaFiles genome.fa \
  --sjdbGTFfile annotations.gtf \
  --runThreadN 16

# Align reads
STAR --genomeDir genome_index/ \
  --readFilesIn sample_R1_trimmed.fastq.gz sample_R2_trimmed.fastq.gz \
  --readFilesCommand zcat \
  --outFileNamePrefix sample_ \
  --outSAMtype BAM SortedByCoordinate \
  --runThreadN 16 \
  --quantMode GeneCounts

Output: BAM files (aligned reads) and alignment statistics

5. Quality Assessment of Alignment

Purpose: Evaluate alignment quality and library characteristics

Tools:

RSeQC
Qualimap
Picard CollectRNASeqMetrics
SAMtools flagstat

Key Metrics:

Mapping rate (typically >70%)
Uniquely mapped reads
Gene body coverage (5' to 3' bias)
Insert size distribution (for PE reads)
Strand specificity
Exon/intron/intergenic distribution

Example:

# Get mapping statistics
samtools flagstat sample_Aligned.sortedByCoord.out.bam

# Gene body coverage
geneBody_coverage.py -i sample_Aligned.sortedByCoord.out.bam \
  -r reference.bed -o sample_geneBodyCoverage

Expected: >70% alignment rate, uniform gene body coverage

6. Quantification

Purpose: Count the number of reads mapping to each gene/transcript

Approaches:

A. Count-based (from BAM files):

featureCounts (fast, accurate)
HTSeq-count (older, slower)

featureCounts -T 8 -p -t exon -g gene_id \
  -a annotations.gtf \
  -o counts.txt \
  *.bam

B. Pseudo-alignment (from FASTQ files):

Salmon (recommended)
Kallisto

# Build transcriptome index (one-time)
salmon index -t transcriptome.fa -i salmon_index

# Quantify
salmon quant -i salmon_index \
  -l A \
  -1 sample_R1_trimmed.fastq.gz \
  -2 sample_R2_trimmed.fastq.gz \
  -o sample_quant \
  --validateMappings

Output: Count matrix (genes × samples)

7. Differential Expression Analysis

Purpose: Identify genes that are significantly different between conditions

Tools (R-based):

DESeq2 (recommended for count data)
edgeR
limma-voom

Analysis Steps:

Import count data
Filter low-expressed genes
Normalize for library size and composition
Estimate dispersion
Fit statistical model
Test for differential expression
Adjust p-values for multiple testing (FDR)

DESeq2 Example:

library(DESeq2)

# Load count matrix
countData <- read.table("counts.txt", header=TRUE, row.names=1)
colData <- read.table("sample_info.txt", header=TRUE, row.names=1)

# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = countData,
                               colData = colData,
                               design = ~ condition)

# Filter low counts
dds <- dds[rowSums(counts(dds)) >= 10, ]

# Run differential expression analysis
dds <- DESeq(dds)

# Get results
res <- results(dds, contrast=c("condition","treated","control"))

# Filter significant genes (padj < 0.05, |log2FC| > 1)
sig_genes <- res[which(res$padj < 0.05 & abs(res$log2FoldChange) > 1), ]

Output: List of differentially expressed genes with statistics (log2 fold change, p-value, adjusted p-value)

8. Functional Enrichment Analysis

Purpose: Understand biological meaning of differentially expressed genes

Analyses:

Gene Ontology (GO) enrichment
KEGG pathway analysis
Gene Set Enrichment Analysis (GSEA)
Reactome pathway analysis

Tools:

clusterProfiler (R)
DAVID
Enrichr
g:Profiler
Metascape

Example (clusterProfiler):

library(clusterProfiler)
library(org.Hs.eg.db)

# Get gene list
gene_list <- rownames(sig_genes)

# GO enrichment
ego <- enrichGO(gene = gene_list,
                OrgDb = org.Hs.eg.db,
                keyType = 'ENSEMBL',
                ont = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff = 0.05)

# KEGG pathway
kegg <- enrichKEGG(gene = gene_list,
                   organism = 'hsa',
                   pvalueCutoff = 0.05)

Output: Enriched biological pathways and processes

9. Visualization

Purpose: Create informative plots for data interpretation and publication

Common Visualizations:

Quality Control:
- FastQC plots
- Mapping statistics bar plots
Exploratory Analysis:
- PCA plot (sample clustering)
- Sample correlation heatmap
- Sample distance heatmap
Differential Expression:
- MA plot (log2FC vs mean expression)
- Volcano plot (log2FC vs -log10(p-value))
- Heatmap of top DEGs
Gene Expression:
- Normalized counts boxplots
- Expression profiles across conditions
Functional Analysis:
- GO/pathway enrichment dot plots
- Network diagrams

Example Visualizations (R):

# PCA plot
vsd <- vst(dds, blind=FALSE)
plotPCA(vsd, intgroup="condition")

# Volcano plot
library(EnhancedVolcano)
EnhancedVolcano(res,
    lab = rownames(res),
    x = 'log2FoldChange',
    y = 'pvalue',
    pCutoff = 0.05,
    FCcutoff = 1)

# Heatmap of top genes
library(pheatmap)
top_genes <- head(order(res$padj), 50)
pheatmap(assay(vsd)[top_genes, ],
         cluster_rows=TRUE,
         show_rownames=TRUE,
         cluster_cols=TRUE,
         annotation_col=as.data.frame(colData(dds)[,"condition"]))

Complete Workflow Summary

Raw FASTQ files
    ↓
Quality Control (FastQC)
    ↓
Read Trimming (Trimmomatic/fastp)
    ↓
Alignment (STAR/HISAT2) or Pseudo-alignment (Salmon/Kallisto)
    ↓
Alignment QC (RSeQC/Qualimap)
    ↓
Quantification (featureCounts/Salmon)
    ↓
Count Matrix
    ↓
Differential Expression Analysis (DESeq2/edgeR)
    ↓
DEG List
    ↓
Functional Enrichment (clusterProfiler/DAVID)
    ↓
Visualization & Interpretation
    ↓
Biological Insights

Additional Considerations

Data Management

Keep organized directory structure
Document all software versions
Save intermediate files
Use version control (Git)
Implement workflow managers (Snakemake, Nextflow)

Statistical Considerations

Batch effects correction (if needed)
Normalization methods (TPM, FPKM, TMM, etc.)
Multiple testing correction
Biological vs. technical replicates

Advanced Analyses

Alternative splicing detection
Novel transcript discovery
Fusion gene detection
Non-coding RNA analysis
Single-cell RNA-seq

Resources

Reference Genomes

GENCODE: Human and mouse annotations
Ensembl: Multi-species genomes
UCSC Genome Browser: Genome assemblies
NCBI RefSeq: Reference sequences

Useful Links

Bioconductor: https://bioconductor.org/
Galaxy Project: https://galaxyproject.org/
RNA-seqlopedia: https://rnaseq.uoregon.edu/

Software Requirements

Essential tools:

# Quality control
fastqc
multiqc

# Trimming
trimmomatic
fastp

# Alignment
STAR
HISAT2
salmon
kallisto

# Quantification
subread (featureCounts)

# Utilities
samtools
bedtools

# Statistical analysis (R packages)
DESeq2
edgeR
limma
clusterProfiler

Citation

If you use this workflow, please cite the appropriate tools used in your analysis. Each tool has its own publication that should be referenced.

Last Updated: 2025-11-01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

abhirichster

Achievements

Achievements

Block or report abhirichster

RNA-seq Data Analysis Pipeline

Overview

Table of Contents

Analysis Pipeline Steps

1. Experimental Design and Sequencing

2. Quality Control

3. Read Trimming and Filtering

4. Read Alignment

5. Quality Assessment of Alignment

6. Quantification

A. Count-based (from BAM files):

B. Pseudo-alignment (from FASTQ files):

7. Differential Expression Analysis

8. Functional Enrichment Analysis

9. Visualization

Complete Workflow Summary

Additional Considerations

Data Management

Statistical Considerations

Advanced Analyses

Resources

Reference Genomes

Useful Links

Software Requirements

Citation

Popular repositories Loading

Uh oh!