ONCORE

ONCOgenic Regulatory Element discovery

Overview

ONCORE is a generalized cancer bioinformatics pipeline that adapts the enhancer analysis workflow from our recent study on SOX10 regulatory elements in melanoma (PMID: 38699319) to identify regulatory elements and potential therapeutic targets across diverse cancer types. The pipeline integrates chromatin accessibility, evolutionary conservation, and transcription factor binding data to systematically prioritize candidate noncoding regulatory elements relevant to oncogenic processes.

Pipeline Architecture

Input Parameters

Target Gene/Locus: Gene of interest (e.g., SOX10, TP53, MYC, EGFR)
Cancer Type: Melanoma, breast, lung, colorectal, etc.
Analysis Window: Genomic region size (default: ±50kb from TSS)
Cell Line Panel: Cancer-relevant cell lines from databases

Stage 1: Data Collection and Preprocessing

1.1 Cell Line Selection

Input: Cancer type, gene expression requirements
Process:
- Query DepMap CERES Gene Effect dataset
- Filter cell lines by:
  - Cancer type/subtype
  - Target gene expression levels
  - Data availability (ChIP-seq, RNA-seq)
- Select 5-10 representative lines
Output: Curated cell line list with metadata

1.2 Genomic Region Definition

Input: Target gene, genome build (hg38/hg19)
Process:
- Retrieve gene coordinates from Ensembl/UCSC
- Define analysis window (default ±50kb from TSS)
- Create GRanges object for downstream analysis
Output: Genomic coordinates for analysis

Stage 2: Chromatin Landscape Analysis

2.1 ChIP-seq Data Integration

Input: Cell line list, genomic coordinates
Data Sources:
- ENCODE Project
- Roadmap Epigenomics
- Cancer-specific databases (TCGA, ICGC)
Histone Marks Analyzed:
- H3K27ac (active enhancers)
- H3K4me1 (enhancer priming)
- H3K4me3 (active promoters)
- H3K27me3 (repressive marks)
Process:
- Download relevant ChIP-seq tracks
- Normalize signal across cell lines
- Identify consensus peaks (present in ≥60% of lines)
Output: Candidate regulatory regions

2.2 Chromatin Accessibility Analysis

Input: Cell line list, genomic coordinates
Data Sources:
- ATAC-seq datasets
- DNase-seq from ENCODE
Process:
- Integrate accessibility data across cell lines
- Identify open chromatin regions
- Correlate with histone modifications
Output: Accessible regulatory elements

Stage 3: Evolutionary Conservation Analysis

3.1 Conservation Scoring

Input: Candidate regulatory regions
Conservation Databases:
- PhastCons (vertebrate conservation)
- PhyloP (evolutionary pressure)
- GERP++ (constraint scores)
Process:
- Retrieve conservation scores for all regions
- Calculate average scores per 1kb tiles
- Define high-conservation threshold (>0.75 PhastCons)
Filter: Retain regions with high evolutionary constraint
Output: Evolutionarily conserved regulatory elements

3.2 Cross-Species Validation

Input: High-conservation regions
Process:
- Map orthologous regions in model organisms
- Check conservation across multiple species
- Validate regulatory potential using comparative data
Output: Cross-species validated elements

Stage 4: Transcription Factor Binding Analysis

4.1 Cancer-Specific TF Identification

Input: Cancer type, target gene
Process:
- Query cancer-specific TF databases:
  - TF2Cancer
  - NetworkAnalyst
  - ChEA3
- Identify key oncogenes/tumor suppressors
- Retrieve binding motifs from JASPAR/HOCOMOCO
Output: Priority TF list with PWMs

4.2 Motif Scanning and Binding Prediction

Input: Conserved regions, TF motifs
Tools: motifmatchr, FIMO, TFBSTools
Process:
- Scan sequences for TF binding sites
- Apply stringent scoring thresholds
- Predict cooperative binding events
- Integrate with ChIP-seq peaks when available
Output: Predicted TF binding sites

Stage 5: Functional Validation and Prioritization

5.1 Expression Correlation Analysis

Input: Regulatory elements, expression data
Data Sources:
- TCGA RNA-seq
- GTEx normal tissues
- Cell line expression panels
Process:
- Correlate enhancer activity with target gene expression
- Perform tissue-specific analysis
- Identify cancer-specific regulatory relationships
Output: Functionally relevant elements

5.2 3D Chromatin Interaction Analysis

Input: Regulatory elements, target gene
Data Sources:
- Hi-C datasets
- ChIA-PET data
- 4C-seq experiments
Process:
- Map chromatin interactions
- Validate enhancer-promoter contacts
- Identify tissue-specific interactions
Output: Validated enhancer-gene pairs

Stage 6: Clinical Relevance Assessment

6.1 Mutation Analysis

Input: Regulatory elements
Data Sources:
- COSMIC database
- TCGA mutation data
- ClinVar annotations
Process:
- Identify mutations in regulatory regions
- Assess mutation frequency across cancer types
- Predict functional impact
Output: Clinically relevant regulatory variants

6.2 Therapeutic Target Identification

Input: Key regulatory elements, bound TFs
Process:
- Identify druggable transcription factors
- Query drug databases (DrugBank, ChEMBL)
- Assess therapeutic potential
- Prioritize based on cancer specificity
Output: Ranked therapeutic targets

Implementation Framework

Required Software Stack

R/Bioconductor Packages:
- GenomicRanges, rtracklayer
- ChIPseeker, DiffBind
- motifmatchr, TFBSTools
- BSgenome packages
- InteractionSet (for 3D data)

Python Libraries:
- pybedtools, pysam
- scikit-learn (for ML models)
- matplotlib, seaborn (visualization)

External Tools:
- bedtools, samtools
- MEME Suite
- deepTools
- HOMER

Data Management

Directory Structure:
/cancer_pipeline/
├── data/
│   ├── chip_seq/
│   ├── expression/
│   ├── conservation/
│   └── mutations/
├── scripts/
├── results/
│   ├── [cancer_type]/
│   │   ├── regulatory_elements/
│   │   ├── tf_binding/
│   │   └── clinical_relevance/
└── reports/

Output Deliverables

1. Regulatory Element Catalog

Genomic coordinates and classifications
Conservation scores and cross-species mapping
Chromatin state annotations
Cell line-specific activity profiles

2. Transcription Factor Network

Cancer-specific TF binding predictions
Cooperative binding analysis
Network topology and key regulators

3. Clinical Annotation Report

Mutation landscape in regulatory regions
Expression correlation analysis
Therapeutic target prioritization
Biomarker potential assessment

4. Interactive Visualizations

Genome browser tracks
Network diagrams
Correlation heatmaps
Clinical relevance plots

Cancer-Specific Adaptations

Breast Cancer Example

Key TFs: ESR1, FOXA1, GATA3, AP1
Cell Lines: MCF7, T47D, MDA-MB-231, SK-BR-3
Specific Features: Hormone receptor status, HER2 amplification

Lung Cancer Example

Key TFs: EGFR, KRAS, TP53, MYC
Cell Lines: A549, H1299, H460, PC9
Specific Features: Smoking signatures, driver mutations

Colorectal Cancer Example

Key TFs: APC, CTNNB1, TP53, KRAS
Cell Lines: HCT116, SW480, Caco2, HT29
Specific Features: Wnt pathway alterations, microsatellite status

Quality Control and Validation

1. Technical Validation

Replicate correlation analysis
Cross-platform validation
Positive/negative control regions

2. Biological Validation

Literature mining for known elements
Functional validation suggestions
Model organism comparisons

3. Statistical Rigor

Multiple testing correction
Power analysis
Confidence interval reporting

Scalability Considerations

Computational Requirements

Memory: 32-64 GB RAM for large datasets
Storage: 1-5 TB depending on scope
Processing: Multi-core CPU, optional GPU acceleration

Parallelization Strategy

Cell line-specific processing
Chromosome-wise analysis
Distributed computing for large cohorts

This pipeline provides a systematic framework for discovering and characterizing regulatory elements across different cancer types, enabling both basic research insights and clinical applications.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
SnakeFile		SnakeFile
config.yaml		config.yaml
environment_files.txt		environment_files.txt
key_scripts.r		key_scripts.r
py_scripts.py		py_scripts.py
setup.md		setup.md

Folders and files

Latest commit

History

Repository files navigation

ONCORE

Overview

Pipeline Architecture

Input Parameters

Stage 1: Data Collection and Preprocessing

1.1 Cell Line Selection

1.2 Genomic Region Definition

Stage 2: Chromatin Landscape Analysis

2.1 ChIP-seq Data Integration

2.2 Chromatin Accessibility Analysis

Stage 3: Evolutionary Conservation Analysis

3.1 Conservation Scoring

3.2 Cross-Species Validation

Stage 4: Transcription Factor Binding Analysis

4.1 Cancer-Specific TF Identification

4.2 Motif Scanning and Binding Prediction

Stage 5: Functional Validation and Prioritization

5.1 Expression Correlation Analysis

5.2 3D Chromatin Interaction Analysis

Stage 6: Clinical Relevance Assessment

6.1 Mutation Analysis

6.2 Therapeutic Target Identification

Implementation Framework

Required Software Stack

Data Management

Output Deliverables

1. Regulatory Element Catalog

2. Transcription Factor Network

3. Clinical Annotation Report

4. Interactive Visualizations

Cancer-Specific Adaptations

Breast Cancer Example

Lung Cancer Example

Colorectal Cancer Example

Quality Control and Validation

1. Technical Validation

2. Biological Validation

3. Statistical Rigor

Scalability Considerations

Computational Requirements

Parallelization Strategy

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages