ONCOgenic Regulatory Element discovery
ONCORE is a generalized cancer bioinformatics pipeline that adapts the enhancer analysis workflow from our recent study on SOX10 regulatory elements in melanoma (PMID: 38699319) to identify regulatory elements and potential therapeutic targets across diverse cancer types. The pipeline integrates chromatin accessibility, evolutionary conservation, and transcription factor binding data to systematically prioritize candidate noncoding regulatory elements relevant to oncogenic processes.
- Target Gene/Locus: Gene of interest (e.g., SOX10, TP53, MYC, EGFR)
- Cancer Type: Melanoma, breast, lung, colorectal, etc.
- Analysis Window: Genomic region size (default: ±50kb from TSS)
- Cell Line Panel: Cancer-relevant cell lines from databases
Input: Cancer type, gene expression requirements
Process:
- Query DepMap CERES Gene Effect dataset
- Filter cell lines by:
- Cancer type/subtype
- Target gene expression levels
- Data availability (ChIP-seq, RNA-seq)
- Select 5-10 representative lines
Output: Curated cell line list with metadata
Input: Target gene, genome build (hg38/hg19)
Process:
- Retrieve gene coordinates from Ensembl/UCSC
- Define analysis window (default ±50kb from TSS)
- Create GRanges object for downstream analysis
Output: Genomic coordinates for analysis
Input: Cell line list, genomic coordinates
Data Sources:
- ENCODE Project
- Roadmap Epigenomics
- Cancer-specific databases (TCGA, ICGC)
Histone Marks Analyzed:
- H3K27ac (active enhancers)
- H3K4me1 (enhancer priming)
- H3K4me3 (active promoters)
- H3K27me3 (repressive marks)
Process:
- Download relevant ChIP-seq tracks
- Normalize signal across cell lines
- Identify consensus peaks (present in ≥60% of lines)
Output: Candidate regulatory regions
Input: Cell line list, genomic coordinates
Data Sources:
- ATAC-seq datasets
- DNase-seq from ENCODE
Process:
- Integrate accessibility data across cell lines
- Identify open chromatin regions
- Correlate with histone modifications
Output: Accessible regulatory elements
Input: Candidate regulatory regions
Conservation Databases:
- PhastCons (vertebrate conservation)
- PhyloP (evolutionary pressure)
- GERP++ (constraint scores)
Process:
- Retrieve conservation scores for all regions
- Calculate average scores per 1kb tiles
- Define high-conservation threshold (>0.75 PhastCons)
Filter: Retain regions with high evolutionary constraint
Output: Evolutionarily conserved regulatory elements
Input: High-conservation regions
Process:
- Map orthologous regions in model organisms
- Check conservation across multiple species
- Validate regulatory potential using comparative data
Output: Cross-species validated elements
Input: Cancer type, target gene
Process:
- Query cancer-specific TF databases:
- TF2Cancer
- NetworkAnalyst
- ChEA3
- Identify key oncogenes/tumor suppressors
- Retrieve binding motifs from JASPAR/HOCOMOCO
Output: Priority TF list with PWMs
Input: Conserved regions, TF motifs
Tools: motifmatchr, FIMO, TFBSTools
Process:
- Scan sequences for TF binding sites
- Apply stringent scoring thresholds
- Predict cooperative binding events
- Integrate with ChIP-seq peaks when available
Output: Predicted TF binding sites
Input: Regulatory elements, expression data
Data Sources:
- TCGA RNA-seq
- GTEx normal tissues
- Cell line expression panels
Process:
- Correlate enhancer activity with target gene expression
- Perform tissue-specific analysis
- Identify cancer-specific regulatory relationships
Output: Functionally relevant elements
Input: Regulatory elements, target gene
Data Sources:
- Hi-C datasets
- ChIA-PET data
- 4C-seq experiments
Process:
- Map chromatin interactions
- Validate enhancer-promoter contacts
- Identify tissue-specific interactions
Output: Validated enhancer-gene pairs
Input: Regulatory elements
Data Sources:
- COSMIC database
- TCGA mutation data
- ClinVar annotations
Process:
- Identify mutations in regulatory regions
- Assess mutation frequency across cancer types
- Predict functional impact
Output: Clinically relevant regulatory variants
Input: Key regulatory elements, bound TFs
Process:
- Identify druggable transcription factors
- Query drug databases (DrugBank, ChEMBL)
- Assess therapeutic potential
- Prioritize based on cancer specificity
Output: Ranked therapeutic targets
R/Bioconductor Packages:
- GenomicRanges, rtracklayer
- ChIPseeker, DiffBind
- motifmatchr, TFBSTools
- BSgenome packages
- InteractionSet (for 3D data)
Python Libraries:
- pybedtools, pysam
- scikit-learn (for ML models)
- matplotlib, seaborn (visualization)
External Tools:
- bedtools, samtools
- MEME Suite
- deepTools
- HOMER
Directory Structure:
/cancer_pipeline/
├── data/
│ ├── chip_seq/
│ ├── expression/
│ ├── conservation/
│ └── mutations/
├── scripts/
├── results/
│ ├── [cancer_type]/
│ │ ├── regulatory_elements/
│ │ ├── tf_binding/
│ │ └── clinical_relevance/
└── reports/
- Genomic coordinates and classifications
- Conservation scores and cross-species mapping
- Chromatin state annotations
- Cell line-specific activity profiles
- Cancer-specific TF binding predictions
- Cooperative binding analysis
- Network topology and key regulators
- Mutation landscape in regulatory regions
- Expression correlation analysis
- Therapeutic target prioritization
- Biomarker potential assessment
- Genome browser tracks
- Network diagrams
- Correlation heatmaps
- Clinical relevance plots
- Key TFs: ESR1, FOXA1, GATA3, AP1
- Cell Lines: MCF7, T47D, MDA-MB-231, SK-BR-3
- Specific Features: Hormone receptor status, HER2 amplification
- Key TFs: EGFR, KRAS, TP53, MYC
- Cell Lines: A549, H1299, H460, PC9
- Specific Features: Smoking signatures, driver mutations
- Key TFs: APC, CTNNB1, TP53, KRAS
- Cell Lines: HCT116, SW480, Caco2, HT29
- Specific Features: Wnt pathway alterations, microsatellite status
- Replicate correlation analysis
- Cross-platform validation
- Positive/negative control regions
- Literature mining for known elements
- Functional validation suggestions
- Model organism comparisons
- Multiple testing correction
- Power analysis
- Confidence interval reporting
- Memory: 32-64 GB RAM for large datasets
- Storage: 1-5 TB depending on scope
- Processing: Multi-core CPU, optional GPU acceleration
- Cell line-specific processing
- Chromosome-wise analysis
- Distributed computing for large cohorts
This pipeline provides a systematic framework for discovering and characterizing regulatory elements across different cancer types, enabling both basic research insights and clinical applications.