This repository provides a comprehensive guide to RNA sequencing (RNA-seq) data analysis steps, from raw sequencing reads to biological insights.
RNA-seq is a powerful technique for studying transcriptomes, enabling quantification of gene expression, discovery of novel transcripts, and identification of alternative splicing events. This guide covers the complete analysis workflow.
- Experimental Design and Sequencing
- Quality Control
- Read Trimming and Filtering
- Read Alignment
- Quality Assessment of Alignment
- Quantification
- Differential Expression Analysis
- Functional Enrichment Analysis
- Visualization
Purpose: Design the experiment and generate sequencing data
Key Considerations:
- Sample size: Minimum 3 biological replicates per condition (more is better)
- Sequencing depth: 20-30 million reads for human samples
- Read type: Single-end (SE) or Paired-end (PE)
- Read length: Typically 50-150 bp
- Library preparation: Poly-A selection (mRNA) or rRNA depletion (total RNA)
Output: FASTQ files containing raw sequencing reads
Purpose: Assess the quality of raw sequencing reads
Tools:
- FastQC
- MultiQC
Key Metrics:
- Per base sequence quality
- Per sequence quality scores
- Sequence length distribution
- GC content
- Adapter contamination
- Overrepresented sequences
- Duplication levels
Commands:
# Run FastQC on all samples
fastqc *.fastq.gz -o fastqc_results/
# Aggregate results with MultiQC
multiqc fastqc_results/ -o multiqc_report/Expected Output: Quality reports identifying any issues with sequencing data
Purpose: Remove low-quality bases, adapters, and contaminating sequences
Tools:
- Trimmomatic
- Cutadapt
- fastp
- TrimGalore
Operations:
- Remove adapter sequences
- Trim low-quality bases (typically Q < 20)
- Remove reads below minimum length
- Filter out rRNA contamination (if applicable)
Example (Trimmomatic):
# For paired-end reads
trimmomatic PE -threads 8 \
sample_R1.fastq.gz sample_R2.fastq.gz \
sample_R1_trimmed.fastq.gz sample_R1_unpaired.fastq.gz \
sample_R2_trimmed.fastq.gz sample_R2_unpaired.fastq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36Output: Cleaned, high-quality FASTQ files
Purpose: Map reads to a reference genome or transcriptome
Tools:
- Splice-aware aligners (for genome alignment):
- STAR (fast, accurate)
- HISAT2 (memory efficient)
- TopHat2 (older, less common now)
- Transcriptome aligners:
- Salmon (pseudo-alignment)
- Kallisto (pseudo-alignment)
- Bowtie2 (for transcriptome)
STAR Example:
# Build genome index (one-time step)
STAR --runMode genomeGenerate \
--genomeDir genome_index/ \
--genomeFastaFiles genome.fa \
--sjdbGTFfile annotations.gtf \
--runThreadN 16
# Align reads
STAR --genomeDir genome_index/ \
--readFilesIn sample_R1_trimmed.fastq.gz sample_R2_trimmed.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate \
--runThreadN 16 \
--quantMode GeneCountsOutput: BAM files (aligned reads) and alignment statistics
Purpose: Evaluate alignment quality and library characteristics
Tools:
- RSeQC
- Qualimap
- Picard CollectRNASeqMetrics
- SAMtools flagstat
Key Metrics:
- Mapping rate (typically >70%)
- Uniquely mapped reads
- Gene body coverage (5' to 3' bias)
- Insert size distribution (for PE reads)
- Strand specificity
- Exon/intron/intergenic distribution
Example:
# Get mapping statistics
samtools flagstat sample_Aligned.sortedByCoord.out.bam
# Gene body coverage
geneBody_coverage.py -i sample_Aligned.sortedByCoord.out.bam \
-r reference.bed -o sample_geneBodyCoverageExpected: >70% alignment rate, uniform gene body coverage
Purpose: Count the number of reads mapping to each gene/transcript
Approaches:
- featureCounts (fast, accurate)
- HTSeq-count (older, slower)
featureCounts -T 8 -p -t exon -g gene_id \
-a annotations.gtf \
-o counts.txt \
*.bam- Salmon (recommended)
- Kallisto
# Build transcriptome index (one-time)
salmon index -t transcriptome.fa -i salmon_index
# Quantify
salmon quant -i salmon_index \
-l A \
-1 sample_R1_trimmed.fastq.gz \
-2 sample_R2_trimmed.fastq.gz \
-o sample_quant \
--validateMappingsOutput: Count matrix (genes × samples)
Purpose: Identify genes that are significantly different between conditions
Tools (R-based):
- DESeq2 (recommended for count data)
- edgeR
- limma-voom
Analysis Steps:
- Import count data
- Filter low-expressed genes
- Normalize for library size and composition
- Estimate dispersion
- Fit statistical model
- Test for differential expression
- Adjust p-values for multiple testing (FDR)
DESeq2 Example:
library(DESeq2)
# Load count matrix
countData <- read.table("counts.txt", header=TRUE, row.names=1)
colData <- read.table("sample_info.txt", header=TRUE, row.names=1)
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ condition)
# Filter low counts
dds <- dds[rowSums(counts(dds)) >= 10, ]
# Run differential expression analysis
dds <- DESeq(dds)
# Get results
res <- results(dds, contrast=c("condition","treated","control"))
# Filter significant genes (padj < 0.05, |log2FC| > 1)
sig_genes <- res[which(res$padj < 0.05 & abs(res$log2FoldChange) > 1), ]Output: List of differentially expressed genes with statistics (log2 fold change, p-value, adjusted p-value)
Purpose: Understand biological meaning of differentially expressed genes
Analyses:
- Gene Ontology (GO) enrichment
- KEGG pathway analysis
- Gene Set Enrichment Analysis (GSEA)
- Reactome pathway analysis
Tools:
- clusterProfiler (R)
- DAVID
- Enrichr
- g:Profiler
- Metascape
Example (clusterProfiler):
library(clusterProfiler)
library(org.Hs.eg.db)
# Get gene list
gene_list <- rownames(sig_genes)
# GO enrichment
ego <- enrichGO(gene = gene_list,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05)
# KEGG pathway
kegg <- enrichKEGG(gene = gene_list,
organism = 'hsa',
pvalueCutoff = 0.05)Output: Enriched biological pathways and processes
Purpose: Create informative plots for data interpretation and publication
Common Visualizations:
-
Quality Control:
- FastQC plots
- Mapping statistics bar plots
-
Exploratory Analysis:
- PCA plot (sample clustering)
- Sample correlation heatmap
- Sample distance heatmap
-
Differential Expression:
- MA plot (log2FC vs mean expression)
- Volcano plot (log2FC vs -log10(p-value))
- Heatmap of top DEGs
-
Gene Expression:
- Normalized counts boxplots
- Expression profiles across conditions
-
Functional Analysis:
- GO/pathway enrichment dot plots
- Network diagrams
Example Visualizations (R):
# PCA plot
vsd <- vst(dds, blind=FALSE)
plotPCA(vsd, intgroup="condition")
# Volcano plot
library(EnhancedVolcano)
EnhancedVolcano(res,
lab = rownames(res),
x = 'log2FoldChange',
y = 'pvalue',
pCutoff = 0.05,
FCcutoff = 1)
# Heatmap of top genes
library(pheatmap)
top_genes <- head(order(res$padj), 50)
pheatmap(assay(vsd)[top_genes, ],
cluster_rows=TRUE,
show_rownames=TRUE,
cluster_cols=TRUE,
annotation_col=as.data.frame(colData(dds)[,"condition"]))Raw FASTQ files
↓
Quality Control (FastQC)
↓
Read Trimming (Trimmomatic/fastp)
↓
Alignment (STAR/HISAT2) or Pseudo-alignment (Salmon/Kallisto)
↓
Alignment QC (RSeQC/Qualimap)
↓
Quantification (featureCounts/Salmon)
↓
Count Matrix
↓
Differential Expression Analysis (DESeq2/edgeR)
↓
DEG List
↓
Functional Enrichment (clusterProfiler/DAVID)
↓
Visualization & Interpretation
↓
Biological Insights
- Keep organized directory structure
- Document all software versions
- Save intermediate files
- Use version control (Git)
- Implement workflow managers (Snakemake, Nextflow)
- Batch effects correction (if needed)
- Normalization methods (TPM, FPKM, TMM, etc.)
- Multiple testing correction
- Biological vs. technical replicates
- Alternative splicing detection
- Novel transcript discovery
- Fusion gene detection
- Non-coding RNA analysis
- Single-cell RNA-seq
- GENCODE: Human and mouse annotations
- Ensembl: Multi-species genomes
- UCSC Genome Browser: Genome assemblies
- NCBI RefSeq: Reference sequences
- Bioconductor: https://bioconductor.org/
- Galaxy Project: https://galaxyproject.org/
- RNA-seqlopedia: https://rnaseq.uoregon.edu/
Essential tools:
# Quality control
fastqc
multiqc
# Trimming
trimmomatic
fastp
# Alignment
STAR
HISAT2
salmon
kallisto
# Quantification
subread (featureCounts)
# Utilities
samtools
bedtools
# Statistical analysis (R packages)
DESeq2
edgeR
limma
clusterProfilerIf you use this workflow, please cite the appropriate tools used in your analysis. Each tool has its own publication that should be referenced.
Last Updated: 2025-11-01
