Skip to content
View abhirichster's full-sized avatar

Block or report abhirichster

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
abhirichster/README.md

RNA-seq Data Analysis Pipeline

This repository provides a comprehensive guide to RNA sequencing (RNA-seq) data analysis steps, from raw sequencing reads to biological insights.

Overview

RNA-seq is a powerful technique for studying transcriptomes, enabling quantification of gene expression, discovery of novel transcripts, and identification of alternative splicing events. This guide covers the complete analysis workflow.

Table of Contents

  1. Experimental Design and Sequencing
  2. Quality Control
  3. Read Trimming and Filtering
  4. Read Alignment
  5. Quality Assessment of Alignment
  6. Quantification
  7. Differential Expression Analysis
  8. Functional Enrichment Analysis
  9. Visualization

Analysis Pipeline Steps

1. Experimental Design and Sequencing

Purpose: Design the experiment and generate sequencing data

Key Considerations:

  • Sample size: Minimum 3 biological replicates per condition (more is better)
  • Sequencing depth: 20-30 million reads for human samples
  • Read type: Single-end (SE) or Paired-end (PE)
  • Read length: Typically 50-150 bp
  • Library preparation: Poly-A selection (mRNA) or rRNA depletion (total RNA)

Output: FASTQ files containing raw sequencing reads


2. Quality Control

Purpose: Assess the quality of raw sequencing reads

Tools:

  • FastQC
  • MultiQC

Key Metrics:

  • Per base sequence quality
  • Per sequence quality scores
  • Sequence length distribution
  • GC content
  • Adapter contamination
  • Overrepresented sequences
  • Duplication levels

Commands:

# Run FastQC on all samples
fastqc *.fastq.gz -o fastqc_results/

# Aggregate results with MultiQC
multiqc fastqc_results/ -o multiqc_report/

Expected Output: Quality reports identifying any issues with sequencing data


3. Read Trimming and Filtering

Purpose: Remove low-quality bases, adapters, and contaminating sequences

Tools:

  • Trimmomatic
  • Cutadapt
  • fastp
  • TrimGalore

Operations:

  • Remove adapter sequences
  • Trim low-quality bases (typically Q < 20)
  • Remove reads below minimum length
  • Filter out rRNA contamination (if applicable)

Example (Trimmomatic):

# For paired-end reads
trimmomatic PE -threads 8 \
  sample_R1.fastq.gz sample_R2.fastq.gz \
  sample_R1_trimmed.fastq.gz sample_R1_unpaired.fastq.gz \
  sample_R2_trimmed.fastq.gz sample_R2_unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Output: Cleaned, high-quality FASTQ files


4. Read Alignment

Purpose: Map reads to a reference genome or transcriptome

Tools:

  • Splice-aware aligners (for genome alignment):
    • STAR (fast, accurate)
    • HISAT2 (memory efficient)
    • TopHat2 (older, less common now)
  • Transcriptome aligners:
    • Salmon (pseudo-alignment)
    • Kallisto (pseudo-alignment)
    • Bowtie2 (for transcriptome)

STAR Example:

# Build genome index (one-time step)
STAR --runMode genomeGenerate \
  --genomeDir genome_index/ \
  --genomeFastaFiles genome.fa \
  --sjdbGTFfile annotations.gtf \
  --runThreadN 16

# Align reads
STAR --genomeDir genome_index/ \
  --readFilesIn sample_R1_trimmed.fastq.gz sample_R2_trimmed.fastq.gz \
  --readFilesCommand zcat \
  --outFileNamePrefix sample_ \
  --outSAMtype BAM SortedByCoordinate \
  --runThreadN 16 \
  --quantMode GeneCounts

Output: BAM files (aligned reads) and alignment statistics


5. Quality Assessment of Alignment

Purpose: Evaluate alignment quality and library characteristics

Tools:

  • RSeQC
  • Qualimap
  • Picard CollectRNASeqMetrics
  • SAMtools flagstat

Key Metrics:

  • Mapping rate (typically >70%)
  • Uniquely mapped reads
  • Gene body coverage (5' to 3' bias)
  • Insert size distribution (for PE reads)
  • Strand specificity
  • Exon/intron/intergenic distribution

Example:

# Get mapping statistics
samtools flagstat sample_Aligned.sortedByCoord.out.bam

# Gene body coverage
geneBody_coverage.py -i sample_Aligned.sortedByCoord.out.bam \
  -r reference.bed -o sample_geneBodyCoverage

Expected: >70% alignment rate, uniform gene body coverage


6. Quantification

Purpose: Count the number of reads mapping to each gene/transcript

Approaches:

A. Count-based (from BAM files):

  • featureCounts (fast, accurate)
  • HTSeq-count (older, slower)
featureCounts -T 8 -p -t exon -g gene_id \
  -a annotations.gtf \
  -o counts.txt \
  *.bam

B. Pseudo-alignment (from FASTQ files):

  • Salmon (recommended)
  • Kallisto
# Build transcriptome index (one-time)
salmon index -t transcriptome.fa -i salmon_index

# Quantify
salmon quant -i salmon_index \
  -l A \
  -1 sample_R1_trimmed.fastq.gz \
  -2 sample_R2_trimmed.fastq.gz \
  -o sample_quant \
  --validateMappings

Output: Count matrix (genes × samples)


7. Differential Expression Analysis

Purpose: Identify genes that are significantly different between conditions

Tools (R-based):

  • DESeq2 (recommended for count data)
  • edgeR
  • limma-voom

Analysis Steps:

  1. Import count data
  2. Filter low-expressed genes
  3. Normalize for library size and composition
  4. Estimate dispersion
  5. Fit statistical model
  6. Test for differential expression
  7. Adjust p-values for multiple testing (FDR)

DESeq2 Example:

library(DESeq2)

# Load count matrix
countData <- read.table("counts.txt", header=TRUE, row.names=1)
colData <- read.table("sample_info.txt", header=TRUE, row.names=1)

# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = countData,
                               colData = colData,
                               design = ~ condition)

# Filter low counts
dds <- dds[rowSums(counts(dds)) >= 10, ]

# Run differential expression analysis
dds <- DESeq(dds)

# Get results
res <- results(dds, contrast=c("condition","treated","control"))

# Filter significant genes (padj < 0.05, |log2FC| > 1)
sig_genes <- res[which(res$padj < 0.05 & abs(res$log2FoldChange) > 1), ]

Output: List of differentially expressed genes with statistics (log2 fold change, p-value, adjusted p-value)


8. Functional Enrichment Analysis

Purpose: Understand biological meaning of differentially expressed genes

Analyses:

  • Gene Ontology (GO) enrichment
  • KEGG pathway analysis
  • Gene Set Enrichment Analysis (GSEA)
  • Reactome pathway analysis

Tools:

  • clusterProfiler (R)
  • DAVID
  • Enrichr
  • g:Profiler
  • Metascape

Example (clusterProfiler):

library(clusterProfiler)
library(org.Hs.eg.db)

# Get gene list
gene_list <- rownames(sig_genes)

# GO enrichment
ego <- enrichGO(gene = gene_list,
                OrgDb = org.Hs.eg.db,
                keyType = 'ENSEMBL',
                ont = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff = 0.05)

# KEGG pathway
kegg <- enrichKEGG(gene = gene_list,
                   organism = 'hsa',
                   pvalueCutoff = 0.05)

Output: Enriched biological pathways and processes


9. Visualization

Purpose: Create informative plots for data interpretation and publication

Common Visualizations:

  1. Quality Control:

    • FastQC plots
    • Mapping statistics bar plots
  2. Exploratory Analysis:

    • PCA plot (sample clustering)
    • Sample correlation heatmap
    • Sample distance heatmap
  3. Differential Expression:

    • MA plot (log2FC vs mean expression)
    • Volcano plot (log2FC vs -log10(p-value))
    • Heatmap of top DEGs
  4. Gene Expression:

    • Normalized counts boxplots
    • Expression profiles across conditions
  5. Functional Analysis:

    • GO/pathway enrichment dot plots
    • Network diagrams

Example Visualizations (R):

# PCA plot
vsd <- vst(dds, blind=FALSE)
plotPCA(vsd, intgroup="condition")

# Volcano plot
library(EnhancedVolcano)
EnhancedVolcano(res,
    lab = rownames(res),
    x = 'log2FoldChange',
    y = 'pvalue',
    pCutoff = 0.05,
    FCcutoff = 1)

# Heatmap of top genes
library(pheatmap)
top_genes <- head(order(res$padj), 50)
pheatmap(assay(vsd)[top_genes, ],
         cluster_rows=TRUE,
         show_rownames=TRUE,
         cluster_cols=TRUE,
         annotation_col=as.data.frame(colData(dds)[,"condition"]))

Complete Workflow Summary

Raw FASTQ files
    ↓
Quality Control (FastQC)
    ↓
Read Trimming (Trimmomatic/fastp)
    ↓
Alignment (STAR/HISAT2) or Pseudo-alignment (Salmon/Kallisto)
    ↓
Alignment QC (RSeQC/Qualimap)
    ↓
Quantification (featureCounts/Salmon)
    ↓
Count Matrix
    ↓
Differential Expression Analysis (DESeq2/edgeR)
    ↓
DEG List
    ↓
Functional Enrichment (clusterProfiler/DAVID)
    ↓
Visualization & Interpretation
    ↓
Biological Insights

Additional Considerations

Data Management

  • Keep organized directory structure
  • Document all software versions
  • Save intermediate files
  • Use version control (Git)
  • Implement workflow managers (Snakemake, Nextflow)

Statistical Considerations

  • Batch effects correction (if needed)
  • Normalization methods (TPM, FPKM, TMM, etc.)
  • Multiple testing correction
  • Biological vs. technical replicates

Advanced Analyses

  • Alternative splicing detection
  • Novel transcript discovery
  • Fusion gene detection
  • Non-coding RNA analysis
  • Single-cell RNA-seq

Resources

Reference Genomes

  • GENCODE: Human and mouse annotations
  • Ensembl: Multi-species genomes
  • UCSC Genome Browser: Genome assemblies
  • NCBI RefSeq: Reference sequences

Useful Links

Software Requirements

Essential tools:

# Quality control
fastqc
multiqc

# Trimming
trimmomatic
fastp

# Alignment
STAR
HISAT2
salmon
kallisto

# Quantification
subread (featureCounts)

# Utilities
samtools
bedtools

# Statistical analysis (R packages)
DESeq2
edgeR
limma
clusterProfiler

Citation

If you use this workflow, please cite the appropriate tools used in your analysis. Each tool has its own publication that should be referenced.


Last Updated: 2025-11-01

Popular repositories Loading

  1. abhirichster abhirichster Public

    Config files for my GitHub profile.