Skip to content

sdegeorgia/ONCORE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ONCORE

ONCOgenic Regulatory Element discovery

Overview

ONCORE is a generalized cancer bioinformatics pipeline that adapts the enhancer analysis workflow from our recent study on SOX10 regulatory elements in melanoma (PMID: 38699319) to identify regulatory elements and potential therapeutic targets across diverse cancer types. The pipeline integrates chromatin accessibility, evolutionary conservation, and transcription factor binding data to systematically prioritize candidate noncoding regulatory elements relevant to oncogenic processes.

Pipeline Architecture

Input Parameters

  • Target Gene/Locus: Gene of interest (e.g., SOX10, TP53, MYC, EGFR)
  • Cancer Type: Melanoma, breast, lung, colorectal, etc.
  • Analysis Window: Genomic region size (default: ±50kb from TSS)
  • Cell Line Panel: Cancer-relevant cell lines from databases

Stage 1: Data Collection and Preprocessing

1.1 Cell Line Selection

Input: Cancer type, gene expression requirements
Process:
- Query DepMap CERES Gene Effect dataset
- Filter cell lines by:
  - Cancer type/subtype
  - Target gene expression levels
  - Data availability (ChIP-seq, RNA-seq)
- Select 5-10 representative lines
Output: Curated cell line list with metadata

1.2 Genomic Region Definition

Input: Target gene, genome build (hg38/hg19)
Process:
- Retrieve gene coordinates from Ensembl/UCSC
- Define analysis window (default ±50kb from TSS)
- Create GRanges object for downstream analysis
Output: Genomic coordinates for analysis

Stage 2: Chromatin Landscape Analysis

2.1 ChIP-seq Data Integration

Input: Cell line list, genomic coordinates
Data Sources:
- ENCODE Project
- Roadmap Epigenomics
- Cancer-specific databases (TCGA, ICGC)
Histone Marks Analyzed:
- H3K27ac (active enhancers)
- H3K4me1 (enhancer priming)
- H3K4me3 (active promoters)
- H3K27me3 (repressive marks)
Process:
- Download relevant ChIP-seq tracks
- Normalize signal across cell lines
- Identify consensus peaks (present in ≥60% of lines)
Output: Candidate regulatory regions

2.2 Chromatin Accessibility Analysis

Input: Cell line list, genomic coordinates
Data Sources:
- ATAC-seq datasets
- DNase-seq from ENCODE
Process:
- Integrate accessibility data across cell lines
- Identify open chromatin regions
- Correlate with histone modifications
Output: Accessible regulatory elements

Stage 3: Evolutionary Conservation Analysis

3.1 Conservation Scoring

Input: Candidate regulatory regions
Conservation Databases:
- PhastCons (vertebrate conservation)
- PhyloP (evolutionary pressure)
- GERP++ (constraint scores)
Process:
- Retrieve conservation scores for all regions
- Calculate average scores per 1kb tiles
- Define high-conservation threshold (>0.75 PhastCons)
Filter: Retain regions with high evolutionary constraint
Output: Evolutionarily conserved regulatory elements

3.2 Cross-Species Validation

Input: High-conservation regions
Process:
- Map orthologous regions in model organisms
- Check conservation across multiple species
- Validate regulatory potential using comparative data
Output: Cross-species validated elements

Stage 4: Transcription Factor Binding Analysis

4.1 Cancer-Specific TF Identification

Input: Cancer type, target gene
Process:
- Query cancer-specific TF databases:
  - TF2Cancer
  - NetworkAnalyst
  - ChEA3
- Identify key oncogenes/tumor suppressors
- Retrieve binding motifs from JASPAR/HOCOMOCO
Output: Priority TF list with PWMs

4.2 Motif Scanning and Binding Prediction

Input: Conserved regions, TF motifs
Tools: motifmatchr, FIMO, TFBSTools
Process:
- Scan sequences for TF binding sites
- Apply stringent scoring thresholds
- Predict cooperative binding events
- Integrate with ChIP-seq peaks when available
Output: Predicted TF binding sites

Stage 5: Functional Validation and Prioritization

5.1 Expression Correlation Analysis

Input: Regulatory elements, expression data
Data Sources:
- TCGA RNA-seq
- GTEx normal tissues
- Cell line expression panels
Process:
- Correlate enhancer activity with target gene expression
- Perform tissue-specific analysis
- Identify cancer-specific regulatory relationships
Output: Functionally relevant elements

5.2 3D Chromatin Interaction Analysis

Input: Regulatory elements, target gene
Data Sources:
- Hi-C datasets
- ChIA-PET data
- 4C-seq experiments
Process:
- Map chromatin interactions
- Validate enhancer-promoter contacts
- Identify tissue-specific interactions
Output: Validated enhancer-gene pairs

Stage 6: Clinical Relevance Assessment

6.1 Mutation Analysis

Input: Regulatory elements
Data Sources:
- COSMIC database
- TCGA mutation data
- ClinVar annotations
Process:
- Identify mutations in regulatory regions
- Assess mutation frequency across cancer types
- Predict functional impact
Output: Clinically relevant regulatory variants

6.2 Therapeutic Target Identification

Input: Key regulatory elements, bound TFs
Process:
- Identify druggable transcription factors
- Query drug databases (DrugBank, ChEMBL)
- Assess therapeutic potential
- Prioritize based on cancer specificity
Output: Ranked therapeutic targets

Implementation Framework

Required Software Stack

R/Bioconductor Packages:
- GenomicRanges, rtracklayer
- ChIPseeker, DiffBind
- motifmatchr, TFBSTools
- BSgenome packages
- InteractionSet (for 3D data)

Python Libraries:
- pybedtools, pysam
- scikit-learn (for ML models)
- matplotlib, seaborn (visualization)

External Tools:
- bedtools, samtools
- MEME Suite
- deepTools
- HOMER

Data Management

Directory Structure:
/cancer_pipeline/
├── data/
│   ├── chip_seq/
│   ├── expression/
│   ├── conservation/
│   └── mutations/
├── scripts/
├── results/
│   ├── [cancer_type]/
│   │   ├── regulatory_elements/
│   │   ├── tf_binding/
│   │   └── clinical_relevance/
└── reports/

Output Deliverables

1. Regulatory Element Catalog

  • Genomic coordinates and classifications
  • Conservation scores and cross-species mapping
  • Chromatin state annotations
  • Cell line-specific activity profiles

2. Transcription Factor Network

  • Cancer-specific TF binding predictions
  • Cooperative binding analysis
  • Network topology and key regulators

3. Clinical Annotation Report

  • Mutation landscape in regulatory regions
  • Expression correlation analysis
  • Therapeutic target prioritization
  • Biomarker potential assessment

4. Interactive Visualizations

  • Genome browser tracks
  • Network diagrams
  • Correlation heatmaps
  • Clinical relevance plots

Cancer-Specific Adaptations

Breast Cancer Example

  • Key TFs: ESR1, FOXA1, GATA3, AP1
  • Cell Lines: MCF7, T47D, MDA-MB-231, SK-BR-3
  • Specific Features: Hormone receptor status, HER2 amplification

Lung Cancer Example

  • Key TFs: EGFR, KRAS, TP53, MYC
  • Cell Lines: A549, H1299, H460, PC9
  • Specific Features: Smoking signatures, driver mutations

Colorectal Cancer Example

  • Key TFs: APC, CTNNB1, TP53, KRAS
  • Cell Lines: HCT116, SW480, Caco2, HT29
  • Specific Features: Wnt pathway alterations, microsatellite status

Quality Control and Validation

1. Technical Validation

  • Replicate correlation analysis
  • Cross-platform validation
  • Positive/negative control regions

2. Biological Validation

  • Literature mining for known elements
  • Functional validation suggestions
  • Model organism comparisons

3. Statistical Rigor

  • Multiple testing correction
  • Power analysis
  • Confidence interval reporting

Scalability Considerations

Computational Requirements

  • Memory: 32-64 GB RAM for large datasets
  • Storage: 1-5 TB depending on scope
  • Processing: Multi-core CPU, optional GPU acceleration

Parallelization Strategy

  • Cell line-specific processing
  • Chromosome-wise analysis
  • Distributed computing for large cohorts

This pipeline provides a systematic framework for discovering and characterizing regulatory elements across different cancer types, enabling both basic research insights and clinical applications.