diff --git a/.gitignore b/.gitignore new file mode 100644 index 000000000..306abaf05 --- /dev/null +++ b/.gitignore @@ -0,0 +1,8 @@ +# IntelliJ IDEA files and directories +.idea/* + +# Output directories +/out/* + +# macOS system files +.DS_Store diff --git a/AUTHORS.rst b/AUTHORS.rst deleted file mode 100644 index 4b4e592e6..000000000 --- a/AUTHORS.rst +++ /dev/null @@ -1,7 +0,0 @@ -Authors -------- - -* Eric (New contributor) -* Anthony - - diff --git a/docs/clever.rst b/docs/clever.rst index a368bfc53..308a52d46 100644 --- a/docs/clever.rst +++ b/docs/clever.rst @@ -1,10 +1,11 @@ -======= +========= ExPectoSC -======= +========= Introduction ------------ + ExPectoSC is a framework for `ab initio` sequence-based prediction of mutation gene expression effects for primary human cell types. With this web interface, we provide an explorer of cell type-specific expression effect predictions. The current release contains all ClinVar variants within +/- 20kb of the representative TSS of a gene. We use 1000 Genomes variant effects predictions for z-score normalization. No effect threshold was employed for the current release of the data. The code for predicting expression effects for human genome variants and training new expression models is available at this `github repository `_. @@ -14,7 +15,7 @@ The ExPectoSC framework is described in the following manuscript: Ksenia Sokolova, Chandra L. Theesfeld, Aaron K. Wong, Zijun Zhang, Kara Dolinski and Olga G. Troyanskaya, Atlas of primary cell-type specific sequence models of gene expression and variant effects, Submitted, 2023 Website overview ------------- +---------------- After user enters a gene name, the Primary View is returned showing the predictions for the pre-computed variants for the region (includes 1000G and ClinVar variants). The variants are oriented so that the lowest chromosomal coordinate for the gene region is on the left side of the screen. The heatmap colors represent the max effect cell type prediction within the organ system. Rows are grouped organ systems, and columns are variant locations: .. image:: img/expectosc_img1.png @@ -26,22 +27,22 @@ To see details about the top cell type and effect per variant the user can hover .. image:: img/expectosc_img2.png :width: 800 :alt: Primary view with hover information - + To see all the cell type predictions for an organ system, the user can click on the organ name. For example, here are the PTEN results for brain: - + .. image:: img/expectosc_img3.png :width: 800 :alt: Details page - + As previously, hovering over the heat map shows additional information about the variant and effect: - + .. image:: img/expectosc_img4.png :width: 800 :alt: Details page with hover - - + + Drop-down menu in the upper left corner allows users to select multiple organ cell types at the same time for a side-by-side comparison: - + .. image:: img/expectosc_img5.jpg :width: 800 :alt: Drop-down menu @@ -50,9 +51,9 @@ Drop-down menu in the upper left corner allows users to select multiple organ ce Download -------- -`ClinVar scaled non-coding predictions `_ +`ClinVar scaled non-coding predictions `_ -`sLDSC annotations `_ +`sLDSC annotations `_ `DeepSEA weights `_ @@ -61,5 +62,5 @@ Method Details -------------- ExPectoSC is a modular framework, that uses regularized linear module upon deep convolutional network model of chromatin profifiling effects to predict cell type specific expression. The framework is capable of predicting expression levels directly from sequence and is sensitive to the sequence variations. -The chromatin predictions were computed using a DeepSEA "Beluga" model, using sliding window approach of 2000bp width with 200bp step, for the 40kb region surrounding the TSS. Exponential condense function is then used to reduce the dimensionality of the data before using it in the module 2. To analyze effect of the variants we get predictions for the reference and alternative sequences and compare the difference. +The chromatin predictions were computed using a DeepSEA "Beluga" model, using sliding window approach of 2000bp width with 200bp step, for the 40kb region surrounding the TSS. Exponential condense function is then used to reduce the dimensionality of the data before using it in the module 2. To analyze effect of the variants we get predictions for the reference and alternative sequences and compare the difference. diff --git a/docs/conf.py b/docs/conf.py index a1d19adbe..4dc29b285 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -101,6 +101,11 @@ import sphinx_rtd_theme html_theme = 'sphinx_rtd_theme' +html_context = { + "READTHEDOCS_VERSION": "stable", + "current_version": "stable", +} + # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. @@ -114,9 +119,6 @@ # 'navigation_depth': 4, # Depth of the headers shown in the navigation bar } -# Add any paths that contain custom themes here, relative to this directory. -html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] - # on_rtd is whether we are on readthedocs.org, this line of code grabbed from docs.readthedocs.org on_rtd = os.environ.get('READTHEDOCS', None) == 'True' if on_rtd: @@ -141,12 +143,12 @@ # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". -html_static_path = ['_static'] +html_static_path = ['img'] # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied # directly to the root of the documentation. -#html_extra_path = [] +html_extra_path = ['img'] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. diff --git a/docs/deepsea.rst b/docs/deepsea.rst deleted file mode 100644 index 8c7c4464c..000000000 --- a/docs/deepsea.rst +++ /dev/null @@ -1,99 +0,0 @@ -======= -Sei / DeepSEA -======= - -Introduction ------------- - -Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. You can also find the Sei code repository `here `_ or read about our manuscript `here `_. - -Sequence class-level variant effects are computed by comparing the predictions for the reference and the alternative alleles. A positive score indicates an increase in sequence class activity by the alternative allele and vice versa. Sequence class-level scores are computed by projecting the 21,907 chromatin profile predictions for the sequence to the unit vector that represents each sequence class. - -For older DeepSEA models see: -:doc:`beluga` (2019) - - -Input ------ - -File formats -~~~~~~~~~~~~ -We support three types of input: vcf, fasta, bed. If you want to predict effects of noncoding variants, use vcf format input. If you want to predict chromatin feature probabilities for DNA sequences, use fasta format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use bed format. See below for a quick introduction: - -**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. Currently, the genome position needs to be in GRCh37/hg19 - -**Fasta format** input should include sequences of 4096bp length each. If a sequence is longer than 4096bp, only the center 4096bp will be used. - -**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 4096bp-length regions. A minimal example is ``chr1 109817091 109821186``. The three columns are chromosome, start position, and end position. - -Genome coordinates -~~~~~~~~~~~~~~~~~~ -We support only ``GRCh37/hg19`` genome coordinates. You can use LiftOver to convert your coordinates to the correct version. - -Large submissions -~~~~~~~~~~~~~~~~~ -We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly. - - -Output ------- - -Sequence classes -~~~~~~~~~~~~~~~~~~~~~~~~~ - -The Sei framework predicts 40 sequence class scores, covering a wide range of regulatory activities such as cell-type-specific enhancers and promoters, as well as 21,907 chromatin profiles for any DNA sequence. - -To help interpretation, we grouped sequence classes into groups including P (Promoter), E (Enhancer), CTCF (CTCF-cohesin binding), TF (TF binding), PC (Polycomb-repressed), HET (Heterochromatin), TN (Transcription), and L (Low Signal) sequence classes. Please refer to our manuscript for a more detailed description of the sequence classes. - -Note: sequence class predictions are only available for vcf inputs. - -:: - - | Sequence class label | Sequence class name | Rank by size | Group | - |---------------------:|----------------------------------:|-------------:|------:| - | PC1 | Polycomb / Heterochromatin | 0 | PC | - | L1 | Low signal | 1 | L | - | TN1 | Transcription | 2 | TN | - | TN2 | Transcription | 3 | TN | - | L2 | Low signal | 4 | L | - | E1 | Stem cell | 5 | E | - | E2 | Multi-tissue | 6 | E | - | E3 | Brain / Melanocyte | 7 | E | - | L3 | Low signal | 8 | L | - | E4 | Multi-tissue | 9 | E | - | TF1 | NANOG / FOXA1 | 10 | TF | - | HET1 | Heterochromatin | 11 | HET | - | E5 | B-cell-like | 12 | E | - | E6 | Weak epithelial | 13 | E | - | TF2 | CEBPB | 14 | TF | - | PC2 | Weak Polycomb | 15 | PC | - | E7 | Monocyte / Macrophage | 16 | E | - | E8 | Weak multi-tissue | 17 | E | - | L4 | Low signal | 18 | L | - | TF3 | FOXA1 / AR / ESR1 | 19 | TF | - | PC3 | Polycomb | 20 | PC | - | TN3 | Transcription | 21 | TN | - | L5 | Low signal | 22 | L | - | HET2 | Heterochromatin | 23 | HET | - | L6 | Low signal | 24 | L | - | P | Promoter | 25 | P | - | E9 | Liver / Intestine | 26 | E | - | CTCF | CTCF-Cohesin | 27 | CTCF | - | TN4 | Transcription | 28 | TN | - | HET3 | Heterochromatin | 29 | HET | - | E10 | Brain | 30 | E | - | TF4 | OTX2 | 31 | TF | - | HET4 | Heterochromatin | 32 | HET | - | L7 | Low signal | 33 | L | - | PC4 | Polycomb / Bivalent stem cell Enh | 34 | PC | - | HET5 | Centromere | 35 | HET | - | E11 | T-cell | 36 | E | - | TF5 | AR | 37 | TF | - | E12 | Erythroblast-like | 38 | E | - | HET6 | Centromere | 39 | HET | - - - -Regulatory feature scores -~~~~~~~~~~~~~~~~~~~~~~~~~ -* **diffs**: The difference between the the predicted probability of the reference allele and the alternative allele for a regulatory feature (:math:`p_{alt} -p_{ref}`). diff --git a/docs/index.rst b/docs/index.rst index dc2cd6884..35195510f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -40,7 +40,6 @@ Help topics :maxdepth: 2 :glob: - usage functional-networks tissue-networks modules @@ -48,4 +47,5 @@ Help topics sei beluga expecto + clever citations diff --git a/docs/modules.rst b/docs/modules.rst index e9cf110f1..e810e8fc6 100644 --- a/docs/modules.rst +++ b/docs/modules.rst @@ -6,20 +6,19 @@ HumanBase applies community detection to find cohesive gene clusters from a prov Method ------ -The approach\ :sup:`1` is based on shared k-nearest-neighbors (SKNN) and the Louvain community-finding algorithm to cluster the user-selected tissue network into distinct modules of tightly connected genes. The SKNN-based strategy has the advantages of alleviating the effect of high-degree genes and accentuating local network structure by connecting genes that are likely to be functionally clustered together. - +The approach\ :sup:`1` is based on shared k-nearest-neighbors (SKNN) and the Louvain community-finding algorithm to cluster the user-selected tissue network into distinct modules of tightly connected genes. The SKNN-based strategy has the advantages of alleviating the effect of high-degree genes and accentuating local network structure by connecting genes that are likely to be functionally clustered together. + This technique proceeds as follows: - (i) First, we create a subset of the user-selected network containing only the user-provided genes and all the edges between them. Given the resulting graph G with V nodes (user-provided genes) and E edges, with each edge between genes i and j associated with a weight p\ :sub:`ij`, + (i) First, we create a subset of the user-selected network containing only the user-provided genes and all the edges between them. Given the resulting graph G with V nodes (user-provided genes) and E edges, with each edge between genes i and j associated with a weight p\ :sub:`ij`, (ii) Calculate a new weight for the edge between each pair of nodes i and j that is equal to the number of k nearest neighbors (based on the original weights p\ :sub:`ij`) shared by i and j; (iii) Choose the top 5% of the edges based on the new edge weights, and apply a graph clustering algorithm. -This approach has two key desirable characteristics: - (i) Choosing the highest k values instead of all edges deemphasizes high-degree 'hub' nodes and brings equal attention to highly specific edges between low-degree nodes; - (ii) Emphasizing local network-structure by connecting nodes that share a number of local neighbors automatically links genes that are highly likely to be part of the same cluster. - -We use a dynamic :code:`k = min(50, 0.2 * |V|)` to obtain the shared-nearest-neighbor tissue-specific network and apply the Louvain algorithm to cluster this network into distinct modules. To stabilize clustering across different runs of the Louvain algorithm, we run the algorithm 100 times and calculate cluster comembership scores for each pair of genes that was equal to the fraction of times (out of 100) the pair was assigned to the same cluster. Genes are assigned to clusters where their comembership score ≥ 0.9. +This approach has two key desirable characteristics: + (i) Choosing the highest k values instead of all edges deemphasizes high-degree 'hub' nodes and brings equal attention to highly specific edges between low-degree nodes; + (ii) Emphasizing local network-structure by connecting nodes that share a number of local neighbors automatically links genes that are highly likely to be part of the same cluster. -Resulting modules are then tested for functional enrichment using genes annotated to Gene Ontology biological process terms. Representative processes and pathways enriched within each cluster are presented alongside of the cluster with their resulting Q value. The Q value of each term associated to the modules is calculated using one-sided Fisher's exact tests and Benjamini–Hochberg corrections to correct for multiple tests. +We use a dynamic :code:`k = min(50, 0.2 * |V|)` to obtain the shared-nearest-neighbor tissue-specific network and apply the Louvain algorithm to cluster this network into distinct modules. To stabilize clustering across different runs of the Louvain algorithm, we run the algorithm 100 times and calculate cluster comembership scores for each pair of genes that was equal to the fraction of times (out of 100) the pair was assigned to the same cluster. Genes are assigned to clusters where their comembership score ≥ 0.9. +Resulting modules are then tested for functional enrichment using genes annotated to terms from selected databases, including Gene Ontology Biological Process, Disease Ontology, MSigDB Hallmark (H), and MSigDB Canonical Pathways (C2-CP). Representative processes, pathways, and disease associations enriched within each cluster are presented alongside the cluster with their resulting Q values. The Q value of each term associated with the modules is calculated using one-sided Fisher’s exact tests and Benjamini–Hochberg corrections to correct for multiple tests. 1. Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG.(2016) Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature Neuroscience.