diff --git a/docs/_static/config/beginner.yaml b/docs/_static/config/beginner.yaml index 202879534..951dd2834 100644 --- a/docs/_static/config/beginner.yaml +++ b/docs/_static/config/beginner.yaml @@ -26,8 +26,9 @@ algorithms: include: true run1: k: 1 - # run2: # uncomment for step 3.2 - # k: [10, 100] # uncomment for step 3.2 + + # run2: # uncomment for step 3.2 + # k: [10, 100] # uncomment for step 3.2 # Here we specify which pathways to run and other file location information. # Assume that if a dataset label does not change, the lists of associated input files do not change @@ -45,7 +46,7 @@ reconstruction_settings: # Set where everything is saved locations: - reconstruction_dir: "output/basic" + reconstruction_dir: "output/beginner" analysis: # Create one summary per pathway file and a single summary table for all pathways for each dataset diff --git a/docs/_static/images/egf-interactome.png b/docs/_static/images/egf-interactome.png new file mode 100644 index 000000000..465fdf7c1 Binary files /dev/null and b/docs/_static/images/egf-interactome.png differ diff --git a/docs/_static/images/erbb-signaling-pathway.png b/docs/_static/images/erbb-signaling-pathway.png new file mode 100644 index 000000000..503b962ff Binary files /dev/null and b/docs/_static/images/erbb-signaling-pathway.png differ diff --git a/docs/_static/images/pca-kde.png b/docs/_static/images/pca-kde.png new file mode 100644 index 000000000..771b6ee23 Binary files /dev/null and b/docs/_static/images/pca-kde.png differ diff --git a/docs/_static/images/pr-curve-ensemble-nodes-per-algorithm-nodes.png b/docs/_static/images/pr-curve-ensemble-nodes-per-algorithm-nodes.png new file mode 100644 index 000000000..7e8dd20b0 Binary files /dev/null and b/docs/_static/images/pr-curve-ensemble-nodes-per-algorithm-nodes.png differ diff --git a/docs/_static/images/pr-pca-chosen-pathway-per-algorithm-nodes.png b/docs/_static/images/pr-pca-chosen-pathway-per-algorithm-nodes.png new file mode 100644 index 000000000..e5820cc12 Binary files /dev/null and b/docs/_static/images/pr-pca-chosen-pathway-per-algorithm-nodes.png differ diff --git a/docs/_static/images/pr-per-pathway-nodes.png b/docs/_static/images/pr-per-pathway-nodes.png new file mode 100644 index 000000000..ae9239f31 Binary files /dev/null and b/docs/_static/images/pr-per-pathway-nodes.png differ diff --git a/docs/tutorial/advanced.rst b/docs/tutorial/advanced.rst index a306543dd..7a20cc68f 100644 --- a/docs/tutorial/advanced.rst +++ b/docs/tutorial/advanced.rst @@ -1,31 +1,182 @@ +################################### Advanced Capabilities and Features -====================================== +################################### -More like these are all the things we can do with this, but will not be showing +Parameter tuning +================ +Parameter tuning is the process of determining which parameter combinations should be explored for each algorithm for a given dataset. +Parameter tuning focuses on defining and refining the parameter search space. -- mention parameter tuning -- say that parameters are not preset and need to be tuned for each dataset +Each dataset has unique characteristics so there are no preset parameters combinations to use. +Instead, we recommend tuning parameters individually for each new dataset. +SPRAS provides a flexible framework for getting parameter grids for any algorithms for a given dataset. -CHTC integration +Grid Search +------------ -Anything not included in the config file +A grid search systematically checks different combinations of parameter values to see how each affects network reconstruction results. -1. Global Workflow Control +In SPRAS, users can define parameter grids for each algorithm directly in the configuration file. +When executed, SPRAS automatically runs each algorithm across all parameter combinations and collects the resulting subnetworks. -Sets options that apply to the entire workflow. +SPRAS will also support parameter refinement using graph topological heuristics. +These topological metrics help identify parameter regions that produce biologically plausible outputs networks. +Based on these heuristics, SPRAS will generate new configuration files with refined parameter grids for each algorithm per dataset. -- Examples: the container framework (docker, singularity, dsub) and where to pull container images from +Users can further refine these grids by rerunning the updated configuration and adjusting the parameter ranges around the newly identified regions to find and fine-tune the most promising algorithm specific outputs for a given dataset. -running spras with multiple parameter combinations with multiple algorithms on multiple Datasets -- for the tutorial we are only doing one dataset +.. note:: -4. Gold Standards + Some grid search features are still under development and will be added in future SPRAS releases. -Defines the input files SPRAS will use to evaluate output subnetworks +Parameter selection +------------------- -A gold standard dataset is comprised of: +Parameter selection refers to the process of determining which parameter combinations should be used for evaluation on a gold standard dataset. -- a label: defines the name of the gold standard dataset -- node_file or edge_file: a list of either node files or edge files. Only one or the other can exist in a single dataset. At the moment only one edge or one node file can exist in one dataset -- data_dir: the path to where the input gold standard files live -- dataset_labels: a list of dataset labels that link each gold standard links to one or more datasets via the dataset labels +Parameter selection is handled in the evaluation code, which supports multiple parameter selection strategies. +Once the grid space search is complete for each dataset, the user can enable evaluation (by setting evaluation ``include: true``) and it will run all of the parameter selection code. + +PCA-based parameter selection +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The PCA-based approach identifies a representative parameter setting for each pathway reconstruction algorithm on a given dataset. +It selects the single parameter combination that best captures the central trend of an algorithm's reconstruction behavior. + +.. image:: ../_static/images/pca-kde.png + :alt: Principal component analysis visualization across pathway outputs with a kernel density estimate computed on top + :width: 600 + :align: center + +.. raw:: html + +
+ +For each algorithm, all reconstructed subnetworks are projected into an algorithm-specific 2D PCA space based on the set of edges produced by the respective parameter combinations for that algorithm. +This projection summarizes how the algorithm's outputs vary across different parameter combinations, allowing patterns in the outputs to be visualized in a lower-dimensional space. + +Within each PCA space, a kernel density estimate (KDE) is computed over the projected points to identify regions of high density. +The output closest to the highest KDE peak is selected as the most representative parameter setting, as it corresponds to the region where the algorithm most consistently produces similar subnetworks. + +Ensemble network-based parameter selection +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The ensemble-based approach combines results from all parameter settings for each pathway reconstruction algorithm on a given dataset. +Instead of focusing on a single "best" parameter combination, it summarizes the algorithm's overall reconstruction behavior across parameters. + +All reconstructed subnetworks are merged into algorithm-specific ensemble networks, where each edge weight reflects how frequently that interaction appears across the outputs. +Edges that occur more often are assigned higher weights, highlighting interactions that are most consistently recovered by the algorithm. + +These consensus networks help identify the core patterns and overall stability of an algorithm's output's without needing to choose a single parameter setting (no clear optimal parameter combination could exists). + + +Ground truth-based evaluation without parameter selection +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The no parameter selection approach chooses all parameter combinations for each pathway reconstruction algorithm on a given dataset. +This approach can be useful for idenitifying patterns in algorithm performance without favoring any specific parameter setting. + +Evaluation +============ + +In some cases, users may have a gold standard file that allows them to evaluate the quality of the reconstructed subnetworks generated by pathway reconstruction algorithms. + +However, gold standards may not exist for certain types of experimental data where validated ground truth interactions or molecules are unavailable or incomplete. +For example, in emerging research areas or poorly characterized biological systems, interactions may not yet be experimentally verified or fully known, making it difficult to define a reliable reference network for evaluation. + +Adding gold standard datasets and evaluation post analysis a configuration +-------------------------------------------------------------------------- + +In the configuration file, users can specify one or more gold standard datasets to evaluate the subnetworks reconstructed from each dataset. +When gold standards are provided and evaluation is enabled (``include: true``), SPRAS will automatically compare the reconstructed subnetworks for a specific dataset against the corresponding gold standards. + +.. code-block:: yaml + + gold_standards: + - + label: gs1 + node_files: ["gs_nodes0.txt", "gs_nodes1.txt"] + data_dir: "input" + dataset_labels: ["data0"] + - + label: gs2 + edge_files: ["gs_edges0.txt"] + data_dir: "input" + dataset_labels: ["data0", "data1"] + + analysis: + evaluation: + include: true + +A gold standard dataset must include the following types of keys and files: + +- ``label``: a name that uniquely identifies a gold standard dataset throughout the SPRAS workflow and outputs. +- ``node_file`` or ``edge_file``: A list of node or edge files. Only one of these can be defined per gold standard dataset. +- ``data_dir``: The file path of the directory where the input gold standard dataset files are located. +- ``dataset_labels``: a list of dataset labels indicating which datasets this gold standard dataset should be evaluated against. + +When evaluation is enabled, SPRAS will automatically run its built-in evaluation analysis on each defined dataset-gold standard pair. +This evaluation computes metrics such as precision, recall, and precision-recall curves, depending on the parameter selection method used. + +For each pathway, evaluation can be run independently of any parameter selection method (the ground truth-based evaluation without parameter selection idea) to directly inspect precision and recall for each reconstructed network from a given dataset. + +.. image:: ../_static/images/pr-per-pathway-nodes.png + :alt: Precision and recall computed for each pathway and visualized on a scatter plot + :width: 600 + :align: center + +.. raw:: html + +
+ +Ensemble-based parameter selection generates precision-recall curves by thresholding on the frequency of edges across an ensemble of reconstructed networks for an algorithm for given dataset. + +.. image:: ../_static/images/pr-curve-ensemble-nodes-per-algorithm-nodes.png + :alt: Precision-recall curve computed for a single ensemble file / pathway and visualized as a curve + :width: 600 + :align: center + +.. raw:: html + +
+ +PCA-based parameter selection computes a precision and recall for a single reconstructed network selected using PCA from all reconstructed networks for an algorithm for given dataset. + +.. image:: ../_static/images/pr-pca-chosen-pathway-per-algorithm-nodes.png + :alt: Precision and recall computed for each pathway chosen by the PCA-selection method and visualized on a scatter plot + :width: 600 + :align: center + +.. raw:: html + +
+ +.. note:: + Evaluation will only execute if ml has ``include: true``, because the PCA parameter selection step depends on the PCA ML analysis. + +.. note:: + To see evaluation in action, run SPRAS using the config.yaml or egfr.yaml configuration files. + +HTCondor integration +===================== + +Running SPRAS locally can become slow and resource intensive, especially when running many algorithms, parameter combinations, or datasets simultaneously. + +To address this, SPRAS supports an integration with `HTCondor `__ (a high throughput computing system), allowing Snakemake jobs to be distributed in parallel and executed across available compute. + +See :doc:`Running with HTCondor <../htcondor>` for more information on SPRAS's integrations with HTConder. + + +Ability to run with different container frameworks +--------------------------------------------------- + +CHTC uses Apptainer to run containerized software in secure, high-performance environments. + +SPRAS accommodates this by allowing users to specify which container framework to use globally within their workflow configuration. + +The global workflow control section in the configuration file allows a user to set which SPRAS supported container framework to use: + +.. code-block:: yaml + + container_framework: docker + +The frameworks include Docker, Apptainer/Singularity, or dsub diff --git a/docs/tutorial/beginner.rst b/docs/tutorial/beginner.rst index f7ab0d802..9c8f7f236 100644 --- a/docs/tutorial/beginner.rst +++ b/docs/tutorial/beginner.rst @@ -9,23 +9,18 @@ You will learn how to: - Set up the SPRAS software environment - Explore the folder structure and understand how inputs, configurations, and outputs are organized - Configure and run a pathway reconstruction algorithm on a provided dataset -- Enable post-analysis steps to generate post analysis information (summary statistics and Cytoscape visualizations) +- Enable post-analysis steps to generate post analysis information Step 0: Clone the SPRAS repository, set up the environment, and run Docker ========================================================================== -0.1 Start Docker ----------------- - -Launch Docker Desktop and wait until it says "Docker is running". - -0.2 Clone the SPRAS repository +0.1 Clone the SPRAS repository ------------------------------- Visit the `SPRAS GitHub repository `__ and clone it locally -0.3 Set up the SPRAS environment +0.2 Set up the SPRAS environment ------------------------------------- From the root directory of the SPRAS repository, create and activate the Conda environment and install the SPRAS python package: @@ -36,7 +31,14 @@ From the root directory of the SPRAS repository, create and activate the Conda e conda activate spras python -m pip install . -0.4 Test the installation +.. note:: + The first command performs a one-time installation of the SPRAS dependencies by creating a Conda environment (an isolated space that keeps all required packages and versions separate from your system). + + The second command activates the newly created environment so you can use these dependencies when running SPRAS; this step must be done each time you open a new terminal session. + + The last command is a one-time installation of the SPRAS package into the environment. + +0.3 Test the installation ------------------------- Run the following command to confirm that SPRAS has been set up successfully from the command line: @@ -45,39 +47,76 @@ Run the following command to confirm that SPRAS has been set up successfully fro python -c "import spras; print('SPRAS import successful')" -Step 1: Explanation of configuration file -========================================= +0.4 Start Docker +---------------- + +Before running SPRAS, make sure Docker Desktop is running. + +Launch Docker Desktop and wait until it says "Docker is running". + +.. note:: + SPRAS itself does not run inside a Docker container. + However, Docker is required because SPRAS uses it to execute individual pathway reconstruction algorithms and certain post-analysis steps within isolated containers. + These containers include all the necessary dependencies to run each algorithm or post analysis. + +Step 1: Configuration files +============================ A configuration file specifies how a SPRAS workflow should run; think of it as the control center for the workflow. + It defines which algorithms to run, the parameters to use, the datasets and gold standards to include, the analyses to perform after reconstruction, and the container settings for execution. -SPRAS uses Snakemake (a workflow manager) and containerized software (like Docker and Apptainer), to read the configuration file and execute a SPRAS workflow. +The configuration files used are written in YAML, a human-readable format that uses simple indentation and key-value pairs for data seralizaiton. + +SPRAS uses Snakemake to read the YAML configuration file and execute a SPRAS workflow accordingly. + +.. Snakemake considers a task from the configuration file complete once the expected output files are present in the output directory. +.. As a result, rerunning the same configuration file may do nothing if those files already exist. +.. To continue or rerun SPRAS with the same configuration file, delete the output directory (or its contents) or modify the configuration file so Snakemake regenerates new results. -Snakemake considers a task from the configuration file complete once the expected output files are present in the output directory. -As a result, rerunning the same configuration file may do nothing if those files already exist. -To continue or rerun SPRAS with the same configuration file, delete the output directory (or its contents) or modify the configuration file so Snakemake regenerates new results. +1.1 Save config for this tutorial +---------------------------------- For this part of the tutorial, we'll use a pre-defined configuration file. Download it here: :download:`Beginner Config File <../_static/config/beginner.yaml>` -Save the file into the config/ folder of your SPRAS installation. -After adding this file, SPRAS will use the configuration to set up and reference your directory structure, which will look like this: +Save the file into the ``config/`` folder of your SPRAS installation. +After adding this file, your directory structure will look like this (ignoring the rest of the folders): .. code-block:: text spras/ ├── config/ - │ └── beginner.yaml + │ ├── beginner.yaml + │ └── ... other configs ... ├── inputs/ - │ ├── phosphosite-irefindex13.0-uniprot.txt # pre-defined in SPRAS already - │ └── tps-egfr-prizes.txt # pre-defined in SPRAS already + │ ├── phosphosite-irefindex13.0-uniprot.txt # pre-defined in SPRAS already, used by the beginner.yaml file + │ ├── tps-egfr-prizes.txt # pre-defined in SPRAS already, used by the beginner.yaml file + │ └── ... other input data ... -Here's an overview of the major sections when looking at a configuration file: +config/ +^^^^^^^^^ -Algorithms ------------ +The ``config/`` folder stores configuration files for SPRAS. + +.. note:: + You can store configuration files anywhere, as long as you provide the correct path when running SPRAS (explained later in this tutorial). +inputs/ +^^^^^^^^ + +The ``inputs/`` folder contains input data files. +You can use the provided example datasets or add your own for custom experiments to this folder. + +.. note:: + Input files can be stored anywhere as long as their paths are correctly referenced in the configuration file (explained later in this tutorial). + +1.2 Overview of the major sections of a configuration file: +------------------------------------------------------------ + +Algorithms +^^^^^^^^^^^ .. code-block:: yaml @@ -95,17 +134,22 @@ Algorithms g: 1e-3 -When defining an algorithm in the configuration file, its name must match one of the supported SPRAS algorithms (introduced in the intermediate tutorial / more information on the algorithms can be found under the Supported Algorithms section). +When defining an algorithm in the configuration file, its name must match one of the supported SPRAS algorithms. Each algorithm includes an include flag, which you set to true to have Snakemake run it, or false to disable it. Algorithm parameters can be organized into one or more run blocks (e.g., run1, run2, …), with each block containing key-value pairs. When defining a parameter, it can be passed as a single value or passed by listing parameters within a list. If multiple parameters are defined as lists within a run block, SPRAS generates all possible combinations (Cartesian product) of those list values together with any fixed single-value parameters in the same run block. Each unique combination runs once per algorithm. -Invalid or missing parameter keys will cause SPRAS to fail. + +Each algorithm exposes its own set of parameters that control its optimization strategy. +Some algorithms have no adjustable parameters, while others include multiple tunable settings that influence how subnetworks are created. +These parameters vary widely between algorithms and reflect the unique optimization techniques each method employs under the hood. + +(See :doc:`Pathway Reconstruction Methods <../prms/prms>` for information about algorithms and their parameters). Datasets --------- +^^^^^^^^^^^ .. code-block:: yaml @@ -120,16 +164,21 @@ Datasets In the configuration file, datasets are defined under the datasets section. Each dataset you define will be run against all of the algorithms enabled in the configuration file. -The dataset must include the following types of keys and files: +A dataset must include the following types of keys and files: + +- ``label``: a name that uniquely identifies a dataset throughout the SPRAS workflow and outputs +- ``node_files``: Input files listing nodes of interest +- ``edge_files``: Input interactome file that defines the relationships between nodes +- ``other_files``: This placeholder is not used +- ``data_dir``: The file path of the directory where the input dataset files are located + +.. note:: + A node represents a molecule, and an edge represents an interaction connecting two molecules. + An interactome is a large network of possible interactions that defines many edges connecting molecules. -- label: a name that uniquely identifies a dataset throughout the SPRAS workflow and outputs. -- node_files: Input files listing the “prizes” or important starting nodes ("sources" or "targets") for the algorithm -- edge_files: Input interactome or network file that defines the relationships between nodes -- other_files: This placefolder is not used -- data_dir: The file path of the directory where the input dataset files are located -Reconstruction Settings ------------------------ +Reconstruction settings +^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml @@ -137,13 +186,11 @@ Reconstruction Settings locations: reconstruction_dir: "output" - The reconstruction_settings section controls where outputs are stored. Set reconstruction_dir to the directory path where you want results saved. SPRAS will automatically create this folder if it doesn't exist. -If you are running multiple configuration files, you can set unique paths to keep outputs organized and separate. Analysis --------- +^^^^^^^^^ .. code-block:: yaml @@ -156,63 +203,77 @@ Analysis include: true - SPRAS includes multiple downstream analyses that can be toggled on or off directly in the configuration file. When enabled, these analyses are performed per dataset and produce summaries or visualizations of the results from all enabled algorithms for that dataset. +.. note:: + The configuration file and sections shown here do not represent the full set of options available in SPRAS. + + The SPRAS documentation is still under construction, and the examples provided (like ``beginner.yaml``) only show the basic configuration needed for this tutorial. + + To see a more complete set of configurable options and parameters, refer to the full examples in ``config/config.yaml`` and ``config/egfr.yaml`` within the SPRAS repository. + Step 2: Running SPRAS on a provided example dataset ==================================================== -2.1 Running SPRAS with the Beginner Configuration +2.1 Running SPRAS with the beginner configuration ------------------------------------------------- -In the beginner.yaml configuration file, it is set up have SPRAS run a single algorithm with one parameter setting on one dataset. +In the ``beginner.yaml`` configuration file, it is set up have SPRAS run a single algorithm with one parameter setting on one dataset. -From the root directory spras/, run the command below from the command line: +From the root directory, run the command below from the command line: .. code:: bash snakemake --cores 1 --configfile config/beginner.yaml -What Happens When You Run This Command +This command starts the workflow manager that automates all steps defined by SPRAS. +It tells Snakemake to use one CPU core and to load settings from the ``config/beginner.yaml`` file. + +What happens when you run this command ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -SPRAS will executes quickly from your perspective; however, several automated steps (handled by Snakemake and Docker) occur behind the scenes. -1. Snakemake starts the workflow +SPRAS will execute quickly from your perspective; however, several automated steps (handled by Snakemake and Docker) occur behind the scenes. -Snakemake reads the options set in the beginner.yaml configuration file and determines which datasets, algorithms, and parameter combinations need to run and if any post-analysis steps were requested. +.. note:: + On Apple computers (M1/M2/M3 chips), the first run may take slightly longer because the SPRAS Docker images are built for AMD architectures, not ARM, so Docker must perform additional image translation before execution. -2. Preparing the dataset +1. Snakemake starts the workflow + +Snakemake reads the options set in the ``beginner.yaml`` configuration file and determines which datasets, algorithms, and parameter combinations need to run and if any post-analysis steps were requested. -SPRAS takes the interactome and node prize files specified in the configuration and bundles them into a Dataset object to be used for processing algorithm specific inputs. -This object is stored as a .pickle file (e.g. dataset-egfr-merged.pickle) so it can be reused for other algorithms without re-processing it. +2. Creating algorithm-specific inputs -3. Creating algorithm specific inputs +For each algorithm marked as ``include: true`` in the configuration, SPRAS generates input files tailored to those algorithms using the dataset specified in the config file. -For each algorithm marked as include: true in the configuration, SPRAS generates input files tailored to that algorithm using the input standardized egfr dataset. In this case, only PathLinker is enabled. -SPRAS creates the network.txt and nodetypes.txt files required by PathLinker in the prepared/egfr-pathlinker-inputs/. +SPRAS creates the files required by PathLinker and places them in the ``prepared/egfr-pathlinker-inputs/`` directory. 4. Organizing results with parameter hashes -Each dataset-algorithm-parameter combination is placed in its own folder named like egfr-pathlinker-params-D4TUKMX/. -D4TUKMX is a hash that uniquely identifies the specific parameter combination (k = 10 here). -A matching log file in logs/parameters-pathlinker-params-D4TUKMX.yaml records the exact parameter values. +Each new --params- combination gets its own folder created in ``output/basic/``. + +For this configuration file only ``egfr-pathlinker-params-D4TUKMX/`` in ``output/basic`` is created. +D4TUKMX is a hash that uniquely identifies a specific parameter combination (k = 10). + +A matching log file is placed in ``logs/parameters-pathlinker-params-D4TUKMX.yaml`` which records the exact parameter value used. 5. Running the algorithm -SPRAS launches the PathLinker Docker image that it downloads from DockerHub, sending it the prepared files and parameter settings. -PathLinker runs and produces a raw pathway output file (raw-pathway.txt) that holds the subnetwork it found in its own native format. +SPRAS downloads the PathLinker Docker image from `DockerHub `__ and launches it in a container, sending the prepared input files and specific parameter settings needed for execution. + +PathLinker runs and generates an output file named ``raw-pathway.txt``, which contains the reconstructed subnetwork in PathLinker's algorithm-specific format. + +SPRAS then saves this file in its corresponding folder. 6. Standardizing the results -SPRAS parses the raw PathLinker output into a standardized SPRAS format (pathway.txt). -This ensures all algorithms output are put into a standardized output, because their native formats differ. +SPRAS parses the raw PathLinker output into a standardized SPRAS format (``pathway.txt``) and SPRAS saves this file in its corresponding folder. 7. Logging the Snakemake run -Snakemake creates a dated log in .snakemake/log/. This log shows what rules ran and any errors that occurred during the SPRAS run. +Snakemake creates a dated log in ``.snakemake/log/``. This log shows what jobs ran and any errors that occurred during the SPRAS run. -What Your Directory Structure Should Like After This Run: +What your directory structure should like after this run: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: text @@ -226,7 +287,7 @@ What Your Directory Structure Should Like After This Run: │ ├── phosphosite-irefindex13.0-uniprot.txt │ └── tps-egfr-prizes.txt ├── outputs/ - │ └── basic/ + │ └── beginner/ │ └── egfr-pathlinker-params-D4TUKMX/ │ └── pathway.txt │ └── raw-pathway.txt @@ -240,59 +301,36 @@ What Your Directory Structure Should Like After This Run: │ └── dataset-egfr-merged.pickle -Step 2.2: Overview of the SPRAS Folder Structure -================================================= - -After running the SPRAS command, you'll see that the folder structure includes four main directories that organize everything needed to run workflows and store their results. - -.. code-block:: text - - spras/ - ├── .snakemake/ - │ └── log/ - │ └── ... snakemake log files ... - ├── config/ - │ └── ... other configs ... - ├── inputs/ - │ └── ... input files ... - ├── outputs/ - │ └── ... output files ... +After running the SPRAS command two more folders are added to SPRAS .snakemake/log/ ---------------- - -The .snakemake/log/ directory contains records of all Snakemake jobs that were executed for the SPRAS run, including any errors encountered during those runs. - -config/ -------- - -Holds configuration files (YAML) that define which algorithms to run, what datasets to use, and which analyses to perform. - -input/ ------- +^^^^^^^^^^^^^^^ -Contains the input data files, such as interactome edge files and input nodes. This is where you can place your own datasets when running custom experiments. +The ``.snakemake/log/`` folder contains records of all Snakemake jobs that were executed for the SPRAS run. output/ ------- -Stores all results generated by SPRAS. Subfolders are created automatically for each run, and their structure can be controlled through the configuration file. +The ``ouput/`` folder stores the results generated during a SPRAS workflow. -By default, the directories are named to be config/, input/, and output/. The config/, input/, and output/ folders can be placed anywhere and named anything within the SPRAS repository. Their input/ and output/ locations can be updated in the configuration file, and the configuration file itself can be set by providing its path when running the SPRAS command. -SPRAS has additional files and directories to use during runs. However, for most users, and for the purposes of this tutorial, it isn't necessary to fully understand them. +.. note:: + Output folders and files can be stored anywhere, as long as the reconstruction_dir parameter in the configuration file is set to the directory path where you want the results to be saved. +.. note:: + SPRAS has additional files and directories to use during runs. However, for most users, and for the purposes of this tutorial, it isn't necessary to fully understand them. -2.4 Running SPRAS with More Parameter Combinations + +2.4 Running SPRAS with more parameter combinations --------------------------------------------------- -In the beginner.yaml configuration file, uncomment the run2 section under pathlinker so it looks like: +In the ``beginner.yaml`` configuration file, uncomment the run2 section under pathlinker so it looks like: .. code-block:: yaml run2: k: [10, 100] -With this update, the beginner.yaml configuration file is set up have SPRAS run a single algorithm with multiple parameter settings on one dataset. +With this update, the ``beginner.yaml`` configuration file is set up have SPRAS run a single algorithm with multiple parameter settings on one dataset. After saving the changes, rerun with: @@ -300,31 +338,38 @@ After saving the changes, rerun with: snakemake --cores 1 --configfile config/beginner.yaml -What Happens When You Run This Command +What happens when you run this command ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. Snakemake loads the configuration file -Snakemake reads beginner.yaml to determine which datasets, algorithms, parameters, and post-analyses to run. +Snakemake again reads ``beginner.yaml`` to determine which datasets, algorithms, parameters, and post-analyses to run. + It reuses cached results to skip completed steps, rerunning only those that are new or outdated. -Here, the dataset pickle, PathLinker inputs, and D4TUKMX parameter set are reused instead of rerun. +Here, the PathLinker prepared inputs are reused. 2. Organizing outputs per parameter combination -Each new dataset-algorithm-parameter combination gets its own folder (e.g egfr-pathlinker-params-7S4SLU6/ and egfr-pathlinker-params-VQL7BDZ/) -The hashes 7S4SLU6 and VQL7BDZ uniquely identifies the specific set of parameters used. +Each new --params- combination gets its own folder created in ``output/basic/``. + +A matching log file is placed in ``logs/parameters--params-.yaml`` which records the exact parameter value used. 3. Reusing prepared inputs with additional parameter combinations -Since PathLinker has already been run once, SPRAS uses the cached prepared inputs (network.txt, nodetypes.txt) rather than regenerating them. -For each new parameter combination, SPRAS executes the PathLinker by launching its corresponding Docker image multiple times (once for each parameter configuration). -PathLinker then runs and produces a raw-pathway.txt file specific to each parameter hash. +For each new parameter combination and its corresponding cached prepared inputs, SPRAS executes PathLinker by launching multiple Docker contatiners (one for each parameter configuration). + +PathLinker then runs and produces a ``raw-pathway.txt`` file specific to each parameter and places it in its corresponding folder. 4. Parsing into standardized results -SPRAS parses each new raw-pathway.txt file into a standardized SPRAS format (pathway.txt). +SPRAS parses each new ``raw-pathway.txt`` file into a standardized SPRAS format (``pathway.txt``) and places it in its corresponding folder. + +5. Logging the Snakemake run -What Your Directory Structure Should Like After This Run: +Snakemake creates a dated log in ``.snakemake/log/``. + + +What your directory structure should like after this run: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: text @@ -338,7 +383,7 @@ What Your Directory Structure Should Like After This Run: │ ├── phosphosite-irefindex13.0-uniprot.txt │ └── tps-egfr-prizes.txt ├── outputs/ - │ └── basic/ + │ └── beginner/ │ └── egfr-pathlinker-params-7S4SLU6/ │ └── pathway.txt │ └── raw-pathway.txt @@ -361,20 +406,19 @@ What Your Directory Structure Should Like After This Run: 2.5 Reviewing the pathway.txt Files -------------------------------------------- +------------------------------------ -Each algorithm and parameter combination produces a corresponding pathway.txt file. -These files contain the reconstructed subnetworks and can be used at face value, or for further post analysis. +Each ``pathway.txt`` file contains the standardized reconstructed subnetworks and can be used at face value, or for further post analysis. 1. Locate the files -Navigate to the output directory spras/output/beginner/. Inside, you will find subfolders corresponding to each dataset-algorithm-parameter combination. +Navigate to the output ``directory spras/output/beginner/``. Inside, you will find subfolders corresponding to each --params- combination. -2. Open a pathway.txt file +2. Open a ``pathway.txt`` file Each file lists the network edges that were reconstructed for that specific run. The format includes columns for the two interacting nodes, the rank, and the edge direction -For example, the file egfr-pathlinker-params-7S4SLU6/pathway.txt contains the following reconstructed subnetwork: +For example, the file ``egfr-pathlinker-params-7S4SLU6/pathway.txt`` contains the following reconstructed subnetwork: .. code-block:: text @@ -398,13 +442,15 @@ For example, the file egfr-pathlinker-params-7S4SLU6/pathway.txt contains the fo K7PPA8_HUMAN MDM4_HUMAN 9 D MDM4_HUMAN MDM2_HUMAN 9 D -The pathway.txt files serve as the foundation for further analysis, allowing you to explore and interpret the reconstructed networks in greater detail. -In this case you can visulize them in cytoscape or compare their statistics to better understand these outputs. +Step 3: Running Post-Analyses +============================== + +3.1 Adding post-analyses to the beginner configuration +------------------------------------------------------ +To enable downstream analyses, update the analysis section in your configuration file by setting both ``summary`` and ``cytoscape`` to have ``include`` set to true. -Step 3: Running Post-Analyses within SPRAS -========================================== -To enable downstream analyses, update the analysis section in your configuration file by setting both summary and cytoscape to true. Your analysis section in the configuration file should look like this: +Your analysis section in the configuration file should look like this: .. code-block:: yaml @@ -414,8 +460,9 @@ To enable downstream analyses, update the analysis section in your configuration cytoscape: include: true -summary generates graph topological summary statistics for each algorithm's parameter combination output, generating a summary file for all reconstructed subnetworks for each dataset. -This post analysis will report these statistics for each pathway: +``summary`` generates graph topological summary statistics for each algorithm's parameter combination output, generating a summary file for all reconstructed subnetworks for a given dataset. + +This will report these statistics for each pathway: - Number of nodes - Number of edges @@ -426,9 +473,10 @@ This post analysis will report these statistics for each pathway: - Maximum diameter - Average path length -cytoscape creates a Cytoscape session file (.cys) containing all reconstructed subnetworks for each dataset, making it easy to upload and visualize them directly in Cytoscape. +``cytoscape`` creates a Cytoscape session file (.cys) that includes all reconstructed subnetworks for a given dataset, eliminating the need to manually create an individual visualization per output. +This makes it easy to upload and visualize all the results directly within Cytoscape. -With this update, the beginner.yaml configuration file is set up for SPRAS to run two post-analyses on the outputs generated by a single algorithm that was executed with multiple parameter settings on one dataset. +With this update, the ``beginner.yaml`` configuration file is set up for SPRAS to run two post-analyses on the outputs generated by a single algorithm that was executed with multiple parameter settings on one dataset. After saving the changes, rerun with: @@ -437,25 +485,28 @@ After saving the changes, rerun with: snakemake --cores 1 --configfile config/beginner.yaml -What Happens When You Run This Command +What happens when you run this command ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. Reusing cached results -Snakemake reads the options set in beginner.yaml and checks for any requested post-analysis steps. -It reuses cached results; in this case, the pathway.txt files generated from the previously executed PathLinker parameter combinations for the egfr dataset. +Snakemake reads the options set in ``beginner.yaml`` and checks for any requested post-analysis steps. + +It reuses cached results; here the ``pathway.txt`` files generated from the previously executed PathLinker algorithm on the egfr dataset are reused. 2. Running the summary analysis -SPRAS aggregates the pathway.txt files from all selected parameter combinations into a single summary table. -The results are saved in egfr-pathway-summary.txt. +SPRAS aggregates the ``pathway.txt`` files from all selected parameter combinations into a single summary table. + +The results are saved in ``egfr-pathway-summary.txt``. 3. Running the Cytoscape analysis -All pathway.txt files from the chosen parameter combinations are collected and passed into the Cytoscape Docker image. -A Cytoscape session file is then generated, containing visualizations for each pathway and saved as egfr-cytoscape.cys. +All ``pathway.txt`` files from the chosen parameter combinations are collected and passed into the Cytoscape Docker image. -What Your Directory Structure Should Like After This Run: -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +A Cytoscape session file is then generated, containing visualizations for each pathway and saved as ``egfr-cytoscape.cys``. + +What your directory structure should like after this run: +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: text spras/ @@ -463,12 +514,12 @@ What Your Directory Structure Should Like After This Run: │ └── log/ │ └── ... snakemake log files ... ├── config/ - │ └── basic.yaml + │ └── beginner.yaml ├── inputs/ │ ├── phosphosite-irefindex13.0-uniprot.txt │ └── tps-egfr-prizes.txt ├── outputs/ - │ └── basic/ + │ └── beginner/ │ └── egfr-pathlinker-params-7S4SLU6/ │ └── pathway.txt │ └── raw-pathway.txt @@ -491,21 +542,17 @@ What Your Directory Structure Should Like After This Run: │ └── egfr-cytoscape.cys │ └── egfr-pathway-summary.txt -Step 3.1: Reviewing the Outputs ------------------------------------ -After completing the workflow, you will have several post analysis outputs that help you explore and interpret the results: - -1. egfr-cytoscape.cys: a Cytoscape session file containing visualizations of the reconstructed subnetworks. -2. egfr-pathway-summary.txt: a summary file with statistics describing each network. +3.1 Reviewing the outputs +-------------------------- -Reviewing Summary Files -^^^^^^^^^^^^^^^^^^^^^^^^ +Reviewing the summary file +^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. Open the summary statistics file -In your file explorer, go to spras/output/basic/egfr-pathway-summary.txt and open it locally. +In your file explorer, go to ``output/beginner/egfr-pathway-summary.txt`` and open it locally. .. image:: ../_static/images/summary-stats.png - :alt: description of the image + :alt: Summary statistics of the three parameter combinations run for PathLinker :align: center .. raw:: html @@ -513,23 +560,27 @@ In your file explorer, go to spras/output/basic/egfr-pathway-summary.txt and ope
-This file summarizes the graph topological statistics for each output pathway.txt file for a given dataset, +This file summarizes the graph topological statistics for each output ``pathway.txt`` file for a given dataset, along with the parameter combinations that produced them, allowing you to interpret and compare algorithm outputs side by side in a compact format. -Reviewing Outputs in Cytoscape +Reviewing outputs in Cytoscape ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. note:: + Cytoscape is an open-source software platform for visualizing networks. + It allows you to explore networks interactively, apply layouts and styles, and integrate additional data for deeper analysis. + 1. Open Cytoscape Launch the Cytoscape application on your computer. 2. Load the Cytoscape session file -Navigate to spras/output/basic/egfr-cytoscape.cys and open it in Cytoscape. +Navigate to ``output/beginner/egfr-cytoscape.cys`` and open it in Cytoscape. .. image:: ../_static/images/cytoscape_upload_network.png - :alt: description of the image - :width: 500 + :alt: Cytoscape and clicking which button to upload the .cys file + :width: 600 :align: center .. raw:: html @@ -537,8 +588,8 @@ Navigate to spras/output/basic/egfr-cytoscape.cys and open it in Cytoscape.
.. image:: ../_static/images/cytoscape-open-cys-file.png - :alt: description of the image - :width: 500 + :alt: Picking which .cys file to upload to Cytoscape + :width: 600 :align: center @@ -549,7 +600,7 @@ Navigate to spras/output/basic/egfr-cytoscape.cys and open it in Cytoscape. Once loaded, the session will display all reconstructed subnetworks for a given dataset, organized by algorithm and parameter combination. .. image:: ../_static/images/cytoscape-opened.png - :alt: description of the image + :alt: What cytoscape should look like after uploading the .cys file :width: 500 :align: center @@ -558,8 +609,8 @@ You can view and interact with each reconstructed subnetwork. Compare how the di The small parameter value (k=1) produced a compact subnetwork: .. image:: ../_static/images/1_pathway.png - :alt: description of the image - :width: 400 + :alt: The output network with parameter combination k = 1 used for PathLinker visualized in Cytoscape + :width: 600 :align: center .. raw:: html @@ -570,7 +621,7 @@ The small parameter value (k=1) produced a compact subnetwork: The moderate parameter value (k=10) expanded the subnetwork, introducing additional nodes and edges that may uncover new connections: .. image:: ../_static/images/10_pathway.png - :alt: description of the image + :alt: The output network with parameter combination k = 10 used for PathLinker visualized in Cytoscape :width: 600 :align: center @@ -581,13 +632,10 @@ The moderate parameter value (k=10) expanded the subnetwork, introducing additio The large parameter value (k=100) generates a much denser subnetwork, capturing a broader range of edges but also could introduce connections that may be less meaningful: .. image:: ../_static/images/100_pathway.png - :alt: description of the image + :alt: The output network with parameter combination k = 100 used for PathLinker visualized in Cytoscape :width: 600 :align: center .. raw:: html -
- -The parameters used here help determine which edges and nodes are included; each setting produces a different subnetwork. -By examining the statistics (egfr-pathway-summary.txt) alongside the visualizations (Cytoscape), you can assess how parameter choices influence both the structure and interpretability of the outputs. \ No newline at end of file +
\ No newline at end of file diff --git a/docs/tutorial/intermediate.rst b/docs/tutorial/intermediate.rst index 5929ec1b0..2e569e092 100644 --- a/docs/tutorial/intermediate.rst +++ b/docs/tutorial/intermediate.rst @@ -1,8 +1,9 @@ -########################################################## -Intermediate Tutorial - Custom Data & Multi-Algorithm Runs -########################################################## +########################################################### +Intermediate Tutorial - Prepare Data & Multi-Algorithm Runs +########################################################### This tutorial builds on the introduction to SPRAS from the previous tutorial. + It guides participants through how to convert data into a format usable by pathway reconstruction algorithms, run multiple algorithms within a single workflow, and apply new tools to interpret and compare the resulting pathways. You will learn how to: @@ -11,105 +12,370 @@ You will learn how to: - Configure and run additional pathway reconstruction algorithms on a dataset - Enable post-analysis steps to generate post analysis information -Step 1: Transforming Data into SPRAS-Compatible Inputs -====================================================== +Step 1: Transforming high throughput experimental data into SPRAS compatible input data +======================================================================================== + +1.1 What is the SPRAS-standardized input data? +----------------------------------------------- + +A pathway reconstruction algorithm requires a set of input nodes and an interactome; however, each algorithm expects its inputs to follow a unique format. + +To simplify this process, SPRAS requires all input data in a dataset to be formatted once into a standardized SPRAS format. +SPRAS then automatically generates algorithm-specific input files when an algorithm is enabled in the configuration file. + +.. note:: + Each algorithm uses the input nodes to guide or constrain the optimization process used to construct reconstruct subnetworks. + + An algorithm maps these input nodes onto the interactome and uses the network to identify connecting paths between the input nodes to form subnetworks. -1.1 Understanding the Data -------------------------------------------------------------------- -We start with mass spectrometry data containing three biological replicates, each with two technical replicates (IMAC and IP). -ADD THAT WE COMBINE THESE TOGETHER -Each replicate measures peptide abundance across multiple time points (0 to 124 minutes). +Pathway reconstruction algorithms differ in the inputs nodes they require and how they interpret those nodes to identify subnetworks. -Show images and charts as to what is changing instead of giving the code +- Some use source and target nodes to defined start and end points. +- Some use prizes, which assign numerical scores assigned to nodes of interest. +- Some rely on active nodes, representing nodes that are significantly “on” under specific conditions. -The goal is to turn this experimental data into the format that SPRAS expects; -a list of proteins with associated prizes and a defined set of source and target proteins. +An example of a node file required by SPRAS follows a tab-separated format: +.. code-block:: text + + NODEID prize sources targets active + A 1.0 True True + B 3.3 True True + C 2.5 True True + D 1.9 True True -1.2 Filtering and Normalizing the Replicates -------------------------------------------------------------------- +.. note:: + If a user provides only one type of input node but wants to run algorithms that require a different type, SPRAS can automatically convert the inputs into the compatible format: -When working with multiple replicates, we want to ensure that all of the peptides measures are present in all three replicates. -This guarantees consistent observation of the peptides across experiments. + - Source-target nodes can be used with all algorithms by making a prize column set to 1 and an active column set to True. + - Prize data can be adapted for active based algorithms by automatically making an active column set to True. + - Active data can be adapted for prize based algorithms by making a prize column set to 1. -For each replicate after removing the peptides that are not in all three replicates, each replicate needs to be renoramlized to ensure each replicate is internally consistent and comparable, reducing bias from replicate specific intensity differences. +Along with differences in their inputs nodes, pathway reconstruction algorithms also interpret the input interactome differently. -1.3 Detecting Significant Changes using Tukey's HSD Test --------------------------------------------------------------- +- Some algorithms can handle only fully directed interactomes. These interactomes include edges with a specific direction (A -> B). +- Others work only with fully undirected interactomes. These interactomes have edges without direction (A - B). +- And some support mixed-directionaltiy interactomes. These interactomes contain both directed and undirected edges. -After filtering and renormalizing, Tukey's Honest Significant Difference (HSD) test is preformed for each peptide. -Tukey's HSD evaluates the significance of differences in mean peptide intensities across all pairs of time points while correcting for multiple comparisons within each peptide's time course. +SPRAS automatically converts the user-provided edge file into the format expected by each algorithm, ensuring that the directionality of the interactome matches the algorithm's requirements. -For each peptide, Tukey's HSD reports a p-value for every pair of time points, representing how likely the observed difference in abundance occurred by chance across the three biological replicates. -Lower p-values indicate stronger evidence that a peptide's abundance truly changes between those time points. +An example of an edge file required by SPRAS follows a tab-separated format. where ``U`` indicates an undirected edge and ``D`` indicates a directed edge: +.. code-block:: text -1.4 From p-values to Prizes for Pathway Reconstruction --------------------------------------------------------- + A B 0.98 U + B C 0.77 D -In SPRAS, prizes quantify how “interesting” a protein is to a given condition. -Peptides with low p-values reflect statistically significant changes and therefore are likely to represent interesting biologically active or perturbed proteins to use for pathway reconstruction. +.. note:: + SPRAS supports multiple standardized input formats. + More information about input data formats can be found in the ``inputs/README.md`` file within the SPRAS repository. -We transform the p-values into scores that capture statistically significant changes across replicates using the transformation -log10(p-value). -This produces higher scores for smaller p-values, highlighting peptides with stronger changes over time. -To compute these scores, we identify the smallest p-value across all relevant time comparisons for each peptide. -The relevant comparisons include each time point versus the baseline (0 min) and each consecutive time point. +1.2 Example high throughput data +--------------------------------- -We then apply the -log10 transformation to the smallest p-value for each peptide to obtain a positive prize score, where smaller p-values yield higher scores. -This process generates a peptide-level prize table that quantifies how strongly each peptide responds over time. +An example dataset is using EGF response mass spectrometry data [4]_. +The experiment for this data was repeated three times, known as biological replicates, to ensure the results are consistent. +Each replicate measures the abundance of peptides at different time points (0-128 minutes) to capture how protein activity changes over time. + +.. note:: + Mass spectrometry is a technique used to measure and identify proteins in a sample. + It works by breaking proteins into smaller pieces called peptides and measuring their mass-to-charge ratio, which enables identifying which peptide is being measured. + The data show how much of each peptide is present, which can show how protein phosphorylation abundances change under different conditions. + + Since proteins interact with each other in biological pathways, changes in their phosphorylation abundances can reveal which parts of a pathway are active or affected. + By mapping these changing proteins onto known interaction networks, pathway reconstruction algorithms can identify which signaling pathways are likely involved in the biological response to a specific condition. + +Example of one of the biological replicate A with one peptide: + +.. list-table:: + :header-rows: 1 + :widths: 20 15 10 10 10 10 10 10 10 10 10 10 + + * - peptide + - protein + - gene.name + - modified.sites + - 0 min + - 2 min + - 4 min + - 8 min + - 16 min + - 32 min + - 64 min + - 128 mn + * - K.n[305.21]AFWMAIGGDRDEIEGLS[167.00]S[167.00]DEEH.- + - Q6PD74,B4DG44,Q5JPJ4,Q6AWA0 + - AAGAB + - S310,S311 + - 14.97 + - 14.81 + - 13.99 + - 13.98 + - 12.87 + - 13.88 + - 13.91 + - 15.60 + + +The goal is to turn this experimental data into the format that SPRAS expects. + + +1.3 Filtering and normalizing the replicates +---------------------------------------------- + +Before analysis, we filter out peptides not present in all three replicates to ensure consistency. +Then, we normalize each replicate so intensity values are comparable and not biased by replicate-specific effects. + +.. list-table:: + :header-rows: 1 + :widths: 20 15 10 10 10 10 10 10 10 10 10 10 10 + + * - peptide + - protein + - gene.name + - modified.sites + - 0 min + - 2 min + - 4 min + - 8 min + - 16 min + - 32 min + - 64 min + - 128 mn + - replicate + * - K.n[305.21]AFWMAIGGDRDEIEGLS[167.00]S[167.00]DEEH.- + - Q6PD74,B4DG44,Q5JPJ4,Q6AWA0 + - AAGAB + - S310,S311 + - 2.17 + - 2.09 + - 1.98 + - 1.78 + - 1.99 + - 2.12 + - 2.25 + - 1.46 + - C + * - K.n[305.21]AFWMAIGGDRDEIEGLS[167.00]S[167.00]DEEH.- + - Q6PD74,B4DG44,Q5JPJ4,Q6AWA0 + - AAGAB + - S310,S311 + - 4.03 + - 3.73 + - 3.32 + - 3.36 + - 3.35 + - 3.37 + - 3.35 + - 3.86 + - B + * - K.n[305.21]AFWMAIGGDRDEIEGLS[167.00]S[167.00]DEEH.- + - Q6PD74,B4DG44,Q5JPJ4,Q6AWA0 + - AAGAB + - S310,S311 + - 5.60 + - 4.75 + - 4.69 + - 4.59 + - 4.32 + - 4.90 + - 4.90 + - 5.48 + - A + +1.4 Computing p-values using Tukey's HSD Test +----------------------------------------------- -1.5 Aggregating Prizes at the Protein Level --------------------------------------------- +We want to calculate the p-values per peptide. +This tells us how likely changes in abundance happen by chance. + +We use Tukey's Honest Significant Difference (HSD) test to compare all time points and correct for multiple testing to get a p-value for every pair of time points. + +.. list-table:: + :header-rows: 1 + :widths: 25 20 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 + + * - peptide + - protein + - 2min vs 0min + - 4min vs 0min + - 8min vs 0min + - 16min vs 0min + - 32min.vs.0min + - 64min.vs.0min + - 128min.vs.0min + - 4min.vs.2min + - 8min.vs.2min + - 16min.vs.2min + - 32min.vs.2min + - 64min.vs.2min + - 128min.vs.2min + - 8min.vs.4min + - 16min.vs.4min + - 32min.vs.4min + - 64min.vs.4min + - 128min.vs.4min + - 16min.vs.8min + - 32min.vs.8min + - 64min.vs.8min + - 128min.vs.8min + - 32min.vs.16min + - 64min.vs.16min + - 128min.vs.16min + - 64min.vs.32min + - 128min.vs.32min + - 128min.vs.64min + * - K.n[305.21]ADVLEAHEAEAEEPEAGK[432.30]S[167.00]EAEDDEDEVDDLPSSR.R + - QQ6PD74,B4DG44,Q5JPJ4,Q6AWA0 + - 0.67 + - 0.25 + - 0.14 + - 0.12 + - 0.52 + - 0.76 + - 0.84 + - 0.99 + - 0.93 + - 0.90 + - 1.00 + - 1.00 + - 1.00 + - 1.00 + - 1.00 + - 1.00 + - 0.97 + - 0.94 + - 1.00 + - 0.98 + - 0.87 + - 0.80 + - 0.96 + - 0.83 + - 0.75 + - 1.00 + - 1.00 + - 1.00 + + + +Peptides with lower p-values are more statistically significant and may represent biologically meaningful changes in phosphorylation over time. + +1.5 From p-values to prizes +---------------------------- + +P-values are transformed using ``-log10(p-value)`` so smaller p-values give larger prize scores. + +For each peptide, the smallest p-value is selected (representing the most significant change) between each time point to the baseline (0 min) and between consecutive time points. +This is because the ultimate network analysis will not use the temporal information. + +For each protein mapped to multiple peptides, the maximum prize value across all its peptides is assigned. + +Finally, all protein identifiers (using the first one listed for each protein) are converted to UniProt Entry Names to match the identifiers that will be used in the interactome. + +.. note:: + All node identifiers should use the same namespace across every part of the data in a dataset. + +.. list-table:: + :header-rows: 1 + :widths: 25 20 7 7 7 + + * - peptide + - protein + - uniprot entry name + - min p-value + - -log10(min p-value) + * - K.n[305.21]AFWMAIGGDRDEIEGLS[167.00]S[167.00]DEEH.- + - Q6PD74,B4DG44,Q5JPJ4,Q6AWA0 + - AAGAB_HUMAN + - 0.12392034609392 + - 0.906857382317364 + + +Input node data put into a SPRAS-standardized format: -Multiple peptides can map to the same protein in this data, so we keep the maximum prize among all its peptides, representing the strongest observed response. +.. code-block:: text -We also convert the protein identifiers to UniProt Entry Names to ensure consistency across the other data sources that will be used, allowing all data components to align within the same naming space. + NODE_ID prize + AAGAB_HUMAN 0.906857382 1.6 From Prizes to Source and Targets / Actives ----------------------------------------------- -- add the egfr pathway (cite it) +.. image:: ../_static/images/erbb-signaling-pathway.png + :alt: The KEGG ErbB signaling pathway (has04012). + :width: 400 + :align: center +.. raw:: html -After assigning protein-level prizes, the next step is to define sources, targets, and actives for use in pathway reconstruction. +
-We use prior biological knowledge to guide this. -For example, in the EGFR signaling pathway, EGF acts as the initiating signal and EGFR as its receptor. -We can set EGF as the source (with the highest prize score) and EGFR as a target (with the second-highest score). -All other pathway proteins are treated as targets (with the score set from the previous step), since they represent downstream components influenced by EGF-EGFR signaling. -Finally, actives refer to nodes in a biological network that are significantly “on” or highly active under a given biological condition. -In this context, all proteins chosen can be considered active since they correspond to active under the given biological condition. +Using known pathway knowledge [1]_ [2]_ [3]_: +- EGF serves as a source for the pathway and was the experimental treatment. +- EGF is known to initiate signaling, so it can be added and assigned a high score (greater than all other nodes) to emphasize its importance and guide algorithms to start reconstruction from this point. (EGF is currently not in the data) +- EGFR is in the current data. Looking at the pathway, we can see that EGFR directly interacts with EGF in the pathway. +- All other downstream proteins detected in the data can also treated as targets. +- All proteins in the data can be considered active since they correspond to proteins that are active under the given biological condition. -1.7 Combing the data into a spras standardized data ---------------------------------------------------- +Input node data put into a SPRAS-standardized format: +.. code-block:: text + + NODE_ID prize source target active + AAGAB_HUMAN 0.906857382 True True + ... more nodes + EGF_HUMAN 10 True True True + EGFR_HUMAN 6.787874699 True True + ... more nodes 1.8 Finding an Interactome to use ---------------------------------- -Next, we need to define the interactome, the background protein-protein interaction (PPI) network used by pathway reconstruction algorithms to identify connections between sources and targets, prizes, and actives. +To connect our proteins, we use a background protein-protein interaction (PPI) network (the interactome). +For this dataset, two interactomes are merged (directed edges prioritized when available): + +- iRefIndex v13 (159,095 undirected interactions) +- PhosphoSitePlus (4,080 directed kinase-substrate interactions) -Databases, such as STRING, contatin interactomes that represent known interacts between proteins. +.. image:: ../_static/images/egf-interactome.png + :alt: The combined interactome of iRefIndex v13 and PhosphoSitePlus + :width: 600 + :align: center +.. raw:: html -However, for this analysis, we use a human PPI network compiled from two sources: +
-- iRefIndex (version 13.0), containing 159,095 undirected interactions, and -- PhosphoSitePlus, containing 4,080 directed kinase–substrate interactions. -We merge the two sources, by prioritizing directed edges wherever possible otherwise keeping the undirected edges. -The final network contains 15,677 proteins, 157,984 undirected, and 3,917 directed interactions, using UniProt Entry Names for the identifiers of the nodes. +The final network has 15,677 proteins and 157,984 edges (~4k of them are directed), and covers 653 of our 702 prize proteins. +The proteins identifiers in the interactome are converted to use UniProt Entry Names. -This interactome includes 653 of the 701 proteins with mass spectrometry-based prizes. +Interactome data put into a SPRAS-standardized format: +.. code-block:: text -8) This data is already saved into SPRAS + TACC1_HUMAN RUXG_HUMAN 0.736771 U + TACC1_HUMAN KAT2A_HUMAN 0.292198 U + TACC1_HUMAN CKAP5_HUMAN 0.724783 U + TACC1_HUMAN YETS4_HUMAN 0.542597 U + TACC1_HUMAN LSM7_HUMAN 0.714823 U + AURKC_HUMAN TACC1_HUMAN 0.553333 D + TACC1_HUMAN AURKA_HUMAN 0.401165 U + TACC1_HUMAN KDM1A_HUMAN 0.367850 U + TACC1_HUMAN MEMO1_HUMAN 0.367850 U + TACC1_HUMAN HD_HUMAN 0.367850 U + ... more edges +.. note:: + Many databases exist that provide interactomes. One is `STRING `__, which contains known protein-protein interactions across different species. + +1.9 This SPRAS-standardized data is already saved into SPRAS +------------------------------------------------------------ .. code-block:: text @@ -120,25 +386,26 @@ This interactome includes 653 of the 701 proteins with mass spectrometry-based p ├── config/ │ └── ... ├── inputs/ - │ ├── THE DATA - │ └── THE NETWORK + │ ├── phosphosite-irefindex13.0-uniprot.txt # pre-defined in SPRAS already, used by the intermediate.yaml file + │ └── tps-egfr-prizes.txt # pre-defined in SPRAS already, used by the intermediate.yaml file ├── outputs/ │ └── basic/ │ └── ... output files ... +The data used in this part of the tutorial can be found in the `supplementary materials `_ under data supplement 2 and supplement 3 [4]_. -Step 2: Adding multiple PRAs to the workflow -============================================= +Step 2: Running multiple algorithms +==================================== -Now that we've prepared our input data, we can begin running multiple pathway reconstruction algorithms on it. +We can begin running multiple pathway reconstruction algorithms. For this part of the tutorial, we'll use a pre-defined configuration file that includes additional algorithms and post-analysis steps available in SPRAS. Download it here: :download:`Intermediate Config File <../_static/config/intermediate.yaml>` Save the file into the config/ folder of your SPRAS installation. -After adding this file, SPRAS will use the configuration to set up and reference your directory structure, which will look like this: +After adding this file, your directory structure will look like this (ignoring the rest of the folders): .. code-block:: text @@ -147,109 +414,94 @@ After adding this file, SPRAS will use the configuration to set up and reference │ └── log/ │ └── ... snakemake log files ... ├── config/ - │ └── basic.yaml - │ └── intermediate.yaml + │ ├── basic.yaml + │ ├── intermediate.yaml + │ └── ... other configs ... ├── inputs/ - │ ├── THE DATA - │ └── THE NETWORK + │ ├── phosphosite-irefindex13.0-uniprot.txt # pre-defined in SPRAS already, used by the intermediate.yaml file + │ ├── tps-egfr-prizes.txt # pre-defined in SPRAS already, used by the intermediate.yaml file + │ └── ... other input data ... ├── outputs/ │ └── basic/ │ └── ... output files ... -2.1 Supported Algorithms in SPRAS +2.1 Algorithms in SPRAS --------------------------------- -SPRAS supports a wide range of algorithms, each designed around different biological assumptions and optimization strategies: - -- Pathlinker -- Omics Integrator 1 -- Omics Integrator 2 -- MEO -- Minimum-Cost Flow -- All pairs shortest paths -- Domino -- Source-Targets Random Walk with Restarts -- Random Walk with Restarts -- BowTieBuilder (Not optimized for large datasets; slower on big networks) -- ResponseNet - -Wrapped Algorithms -^^^^^^^^^^^^^^^^^^^ -Each algorithm has been wrapped by SPRAS. -Wrapping an algorithm in SPRAS involves three main steps: +SPRAS supports a wide range of algorithms, each designed around different biological assumptions and optimization strategies +(See :doc:`Pathway Reconstruction Methods <../prms/prms>` for SPRAS's list of integrated algorithms.) -1. Input generation: SPRAS creates and formats the input files required by the algorithm based on the provided dataset -2. Execution: SPRAS runs the algorithm within its corresponding Docker container, which holds the algorithm code. This is called for each specified parameter combination in the configuration file. -3. Output standardization: The raw outputs are converted into a standardized SPRAS format +Wrapped algorithms +^^^^^^^^^^^^^^^^^^^ +Each pathway reconstruction algorithm within SPRAS has been wrapped for SPRAS, meaning it has been prepared for the SPRAS framework. -Inputs -^^^^^^^ -These pathway reconstruction algorithms differ in the inputs nodes they require and how they interpret those nodes to identify subnetworks. -Some use source and target nodes to connect predefined start and end points, others use prizes, which are scores assigned to nodes of interest, and some rely on active nodes that represent proteins or genes significantly “on” or perturbed under specific biological conditions. +For an algorithm-specific wrapper, the wrapper includes a module that will create and format the input files required by the algorithm using the SPRAS-standardized input data. -Along with differences in their inputs nodes, these algorithms also interpret the input interactome differently. -Some can handle directed graphs, others work only with undirected graphs, and a few support mixed directionaltiy graphs. +Each algorithm has an associated Docker image located on `DockerHub `__ that contains all necessary software dependencies needed to run it. +For an algorithm-specific wrapper, it contains a module that will call each image to launch a container for a specified parameter combination, set of prepared algorithm-specific inputs and an output filename (``raw-pathway.txt``). -Parameters -^^^^^^^^^^ -Each algorithm also exposes its own set of parameters that control its optimization strategy. -Some algorithms have no adjustable parameters, while others include multiple tunable settings that influence how subnetworks are created. -These parameters vary widely between algorithms and reflect the unique optimization techniques each method employs under the hood. +With each of the ``raw-pathway.txt`` files, an algorithm-specific wrapper includes a module that will convert the algorithm-specific format into a standardized SPRAS format. -2.3 Running SPRAS with Multiple Algorithms +2.3 Running SPRAS with multiple algorithms ------------------------------------------ -In the intermediate.yaml configuration file, it is set up have SPRAS run multiple algorithms (all of the algorithms supported in SPRAS except BowTieBuilder) with multiple parameter settings (if available) on one dataset. +In the ``intermediate.yaml`` configuration file, it is set up to have SPRAS run multiple algorithms with multiple parameter settings on a single dataset. -From the root directory spras/, run the command below from the command line: +From the root directory, run the command below from the command line: .. code:: bash snakemake --cores 4 --configfile config/intermediate.yaml -What Happens When You Run This Command +What happens when you run this command ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -SPRAS will run more slowly than the beginner.yaml configuration. -The same automated steps as in beginner.yaml (managed by Snakemake and Docker) run behind the scenes for intermediate.yaml; however, this configuration now runs multiple algorithms with different parameter combinations, which takes longer to complete. +SPRAS will run "slower" when using the ``intermediate.yaml`` configuration. + +Similar automated steps from the previous tutorial runs behind the scenes for ``intermediate.yaml``. +However, this configuration now runs multiple algorithms with different parameter combinations, which takes longer to complete. By increasing the number of cores to 4, it allows Snakemake to parallelize the work locally, speeding up execution when possible. +(See :doc:`Using SPRAS <../usage>` for more information on SPRAS's parallelization.) + 1. Snakemake starts the workflow -Snakemake reads the options set in the intermediate.yaml configuration file and determines which datasets, algorithms, and parameter combinations need to run. It also checks if any post-analysis steps were requested. +Snakemake reads the options set in the ``intermediate.yaml`` configuration file and determines which datasets, algorithms, and parameter combinations need to run. +It also checks if any post-analysis steps were requested. + +2. Creating algorithm-specific inputs -2. Preparing the dataset +For each algorithm marked as ``include: true`` in the configuration, SPRAS generates input files tailored to that algorithm. -SPRAS takes the interactome and node prize files specified in the configuration and bundles them into a Dataset object to be used for processing algorithm specific inputs. -This object is stored as a .pickle file so it can be reused for other algorithms without re-processing it. +In this case, every algorithm is enabled, so SPRAS formats the input files required for each algorithm. -3. Creating algorithm specific inputs +3. Organizing results with parameter hashes -For each algorithm marked as include: true in the configuration, SPRAS generates input files tailored to that algorithm. -In this case, every algorithm is enabled, so SPRAS creates the files required for each algorithm. +Each --params- combination gets its own folder created in ``output/intermediate/``. -4. Organizing results with parameter hashes +A matching log file in ``logs/parameters--params-.yaml`` records the exact parameter values used. -Each --params- combination folder is created. -A matching log file in logs/parameters--params-.yaml records the exact parameter values used. +4. Running the algorithm -5. Running the algorithm +SPRAS pulls each algorithm's Docker image from `DockerHub `__ if it isn't already downloaded locally -SPRAS executes each algorithm by launching its corresponding Docker image multiple times (once for each parameter configuration). -During each run, SPRAS provides the prepared input files and the corresponding parameter settings to the container. Each algorithm then runs independently within its Docker environment and produces a raw pathway output file (raw-pathway.txt), which contains the reconstructed subnetwork in the algorithm's native format. +SPRAS executes each algorithm by launching multiple Docker contatiners using the algorithm-specific Docker image (once for each parameter configuration), sending the prepared input files and specific parameter settings needed for execution. -6. Standardizing the results +Each algorithm runs independently within its Docker container and generates an output file named ``raw-pathway.txt``, which contains the reconstructed subnetwork in the algorithm-specific format. -SPRAS parses each of the raw output into a standardized SPRAS format (pathway.txt). -This ensures all algorithms output are put into a standardized output, because their native formats differ. +SPRAS then saves these files to the corresponding folder. -7. Logging the Snakemake run +5. Standardizing the results -Snakemake creates a dated log in .snakemake/log/. This log shows what rules ran and any errors that occurred during the SPRAS run. +SPRAS parses each of the raw output into a standardized SPRAS format (``pathway.txt``) and SPRAS saves this file in its corresponding folder. +6. Logging the Snakemake run -What Your Directory Structure Should Like After This Run: +Snakemake creates a dated log in ``.snakemake/log/`` This log shows what jobs ran and any errors that occurred during the SPRAS run. + + +What your directory structure should like after this run: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: text @@ -377,15 +629,22 @@ What Your Directory Structure Should Like After This Run: | ├── sources.txt | └── targets.txt -2.4 Reviewing the pathway.txt Files -------------------------------------------- -After running the intermediate configuration file, the output/intermediate/ directory will contain many more subfolders and files. +2.4 Reviewing the pathway.txt files +------------------------------------- +After running the intermediate configuration file, the ``output/intermediate/`` directory will contain many more subfolders and files. + +Again, each ``pathway.txt`` file contains the standardized reconstructed subnetworks and can be used at face value, or for further post analysis. -Just like in the beginner tutorial, each algorithm's results can be found in the spras/output/intermediate/ directory. -Within it, you'll see subfolders corresponding to each dataset-algorithm-parameter combination. -Each folder contains a pathway.txt file that contains the standardized reconstructed subnetwork for that specific run. +1. Locate the files -For example, the file egfr-mincostflow-params-42UBTQI/pathway.txt contains the following reconstructed subnetwork: +Navigate to the output directory ``output/intermediate/``. Inside, you will find subfolders corresponding to each --params- combination. + +2. Open a ``pathway.txt`` file + +Each file lists the network edges that were reconstructed for that specific run. The format includes columns for the two interacting nodes, the rank, and the edge direction. + + +For example, the file ``egfr-mincostflow-params-42UBTQI/pathway.txt`` contains the following reconstructed subnetwork: .. code-block:: text @@ -403,7 +662,7 @@ For example, the file egfr-mincostflow-params-42UBTQI/pathway.txt contains the f EMD_HUMAN SRC_HUMAN 1 U -And the file egfr-omicsintegrator1-params-GUMLBDZ/pathway.txt contains the following reconstructed subnetwork: +And the file ``egfr-omicsintegrator1-params-GUMLBDZ/pathway.txt`` contains the following reconstructed subnetwork: .. code-block:: text @@ -430,29 +689,25 @@ And the file egfr-omicsintegrator1-params-GUMLBDZ/pathway.txt contains the follo MRE11_HUMAN RAD50_HUMAN 1 U -As you explore more of these files, you'll notice that the subnetworks vary widely across algorithms and parameter settings. -While you can still open and inspect these files manually, the number of outputs is much greater than in the beginner.yaml run, making manual inspection less practical. -The pathway.txt outputs serve as the foundation for further post-analysis, where you can systematically compare and interpret the reconstructed networks in greater detail. - -In the next steps, we'll use SPRAS's post analysis tools to further explore and analyze these outputs. - -Step 3: Use ML Post-Analysis +Step 3: Use ML post-analysis ============================= -To enable downstream analyses, update the analysis section in your configuration file by setting both summary, cytoscape, and ml, to true. Your analysis section in the configuration file should look like this: +3.1 Adding ML post-analysis to the intermediate configuration +------------------------------------------------------------- + +To enable the ML analysis, update the analysis section in your configuration file by setting ml to true. +Your analysis section in the configuration file should look like this: .. code-block:: yaml analysis: ml: include: true + ... (other parameters preset) -In this part of the tutorial, we're also including the machine learning (ml) section to enable machine learning-based post-analysis built within SPRAS. +``ml`` will perform unsupervised analyses such as principal component analysis (PCA), hierarchical agglomerative clustering (HAC), ensembling, and jaccard similarity comparisons of the pathways. -The ml analysis will perform unsupervised analyses such as Principal Component Analysis (PCA), Hierarchical Agglomerative Clustering (HAC), ensembling, and Jaccard similarity comparisons of the pathways. -These analyses help uncover patterns and similarities between different algorithms run on a given dataset -- if aggregate_per_algorithm: is set to true, it additionally groups outputs by algorithm within each dataset to uncover patterns and similarities for an algorithm -- The ML section includes configurable parameters that let you adjust the behavior of the ml analyses performed +- The ``ml`` section includes configurable parameters that let you adjust the behavior of the analyses performed. With these updates, SPRAS will run the full set of unsupervised machine learning analyses across all outputs for a given dataset. @@ -463,21 +718,20 @@ After saving the changes in the configuration file, rerun with: snakemake --cores 4 --configfile config/intermediate.yaml -What Happens When You Run This Command +What happens when you run this command ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. Reusing cached results -Snakemake reads the options set in intermediate.yaml and checks for any requested post-analysis steps. -It reuses cached results; in this case, the pathway.txt files generated from the previously executed algorithms + parameter combinations on the egfr dataset. +Snakemake reads the options set in ``intermediate.yaml`` and checks for any requested post-analysis steps. +It reuses cached results; here the ``pathway.txt`` files generated from the previously executed algorithms on the egfr dataset are reused. 2. Running the ml analysis -SPRAS aggregates all files generated for a dataset. -These groupings include all the reconstructed subnetworks produced across algorithm for a given dataset (and, if enabled, grouped outputs per algorithm for a given dataset). -SPRAS then performs all machine learning analyses on each grouping and saves the results in the dataset-ml/ directory. +SPRAS aggregates all the reconstructed subnetworks produced across the specified algorithms for a given dataset. +SPRAS then performs machine learning analyses on each these groups and saves the results in the ``-ml/`` (``egfr-ml/``) folder. -What Your Directory Structure Should Like After This Run: +What your directory structure should like after this run: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: text @@ -616,17 +870,18 @@ What Your Directory Structure Should Like After This Run: | ├── sources.txt | └── targets.txt -Step 3.1: Reviewing the Outputs --------------------------------- +Step 3.2: Reviewing the ML outputs +----------------------------------- Ensembles ^^^^^^^^^ -After running multiple algorithms or parameter settings on the same dataset, SPRAS can ensemble the resulting pathways to identify consistent, high-confidence interactions. -Each pathway output is represented as a binary edge list (1 = edge present, 0 = edge absent). -SPRAS calculates the mean of these binary values across all runs to determine the edge frequency (the proportion of times each edge appears across the outputs). -Edges that occur more often are considered more robust and can be used to build a consensus network. +1. Open the ensemble file +In your file explorer, go to ``output/intermediate/egfr-ml/ensemble-pathway.txt`` and open it locally. + +After running multiple algorithms or parameter settings on the same dataset, SPRAS can ensemble the resulting pathways to identify consistent, high-frequency interactions. +SPRAS calculates the edge frequency by calculating the proportion of times each edge appears across the outputs. .. code-block:: text @@ -647,20 +902,26 @@ Edges that occur more often are considered more robust and can be used to build K7PPA8_HUMAN EP300_HUMAN 0.09523809523809523 D ... -High frequency edges indicate interactions consistently recovered by multiple algorithms, suggesting stronger biological relevance. +High frequency edges indicate interactions consistently recovered by multiple algorithms. Low frequency edges may reflect noise or algorithm-specific connections. -HAC -^^^ -SPRAS includes Hierarchical Agglomerative Clustering (HAC) to group similar pathways outputs based on shared edges. +Hierarchical agglomerative clustering +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +1. Open the HAC image(s) + +In your file explorer, go to ``output/intermediate/egfr-ml/hac-horizontal.png`` and/or ``output/intermediate/egfr-ml/hac-vertical.png`` and open it locally. + + +SPRAS includes HAC to group similar pathways outputs based on shared edges. This helps identify clusters of algorithms that produce comparable subnetworks and highlights distinct reconstruction behaviors. In the plots below, each branch represents a cluster of related pathways. -Shorter distances between branches indicate greater similarity. +Shorter distances between branches indicate outputs with greater similarity. .. image:: ../_static/images/hac-horizontal.png - :alt: description of the image - :width: 500 + :alt: Hierarchical agglomerative clustering horizontal view + :width: 600 :align: center .. raw:: html @@ -668,7 +929,7 @@ Shorter distances between branches indicate greater similarity.
.. image:: ../_static/images/hac-vertical.png - :alt: description of the image + :alt: Hierarchical agglomerative clustering vertical view with colors only :width: 300 :align: center @@ -677,40 +938,57 @@ Shorter distances between branches indicate greater similarity.
HAC visualizations help compare which algorithms and parameter settings produce similar pathway structures. -Tight clusters indicate similar behavior, while isolated branches may reveal unique or outlier results. +Tight clusters indicate similar output behavior, while isolated branches may reveal unique results. + +Principal component analysis +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -PCA -^^^ -SPRAS also includes Principal Component Analysis (PCA) to visualize variation across pathway outputs. -Each point represents a pathway, places based on its overall network structure. +1. Open the PCA image + +In your file explorer, go to ``output/intermediate/egfr-ml/pca.png`` and open it locally. + +SPRAS also includes PCA to visualize variation across pathway outputs. +Each point represents a pathway, placed based on its overall network structure. Pathways that cluster together in PCA space are more similar, while those farther apart differ in their reconstructed subnetworks. .. image:: ../_static/images/pca.png - :alt: description of the image - :width: 500 + :alt: Principal component analysis visualization across pathway outputs + :width: 600 :align: center .. raw:: html
-PCA can help identify patterns such as clusters of similar algorithms, parameter sensitivities, or outlier outputs. +PCA may help identify patterns such as clusters of similar algorithms outputs, parameter sensitivities, and/or outlier outputs. -Jaccard Similarity +Jaccard similarity ^^^^^^^^^^^^^^^^^^ -SPRAS computes pairwise Jaccard similarity between pathway outputs to measure how much overlap exists between their reconstructed subnetworks. -The Jaccard index is calculated from the binary edge representation of each pathway and reflects the proportion of shared edges between two pathways relative to their total combined edges. +1. Open the jaccard heatmap image + +In your file explorer, go to ``output/intermediate/egfr-ml/jaccard-heatmap.png`` and open it locally. + -Higher similarity values indicate that pathways share many of the same interactions, while lower values suggest distinct or divergent reconstructions. +SPRAS computes pairwise jaccard similarity between pathway outputs to measure how much overlap exists between their reconstructed subnetworks. +The heatmap visualizes how similar the output pathways are between algorithms and their parameter settings. .. image:: ../_static/images/jaccard-heatmap.png - :alt: description of the image - :width: 500 + :alt: Jaccard heatmap of the overlap between pathway outputs + :width: 600 :align: center .. raw:: html
-The heatmap visualizes how similar the output pathways are between algorithms and parameter settings. \ No newline at end of file +Higher similarity values indicate that pathways share many of the same edges, while lower values suggest distinct reconstructions. + + +References +----------- + +.. [1] Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. and Ishiguro-Watanabe, M.; KEGG: biological systems database as a model of the real world. Nucleic Acids Res. 53, D672-D677 (2025). +.. [2] Kanehisa, M; Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947-1951 (2019) +.. [3] Kanehisa, M. and Goto, S.; KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30 (2000). +.. [4] Köksal AS, Beck K, Cronin DR, McKenna A, Camp AND, Srivastava S, MacGilvray ME, Bodík R, Wolf-Yadlin A, Fraenkel E, Fisher J, Gitter A. Synthesizing Signaling Pathways from Temporal Phosphoproteomic Data. Cell Rep. 2018 Sep 25;24(13):3607-3618. doi: 10.1016/j.celrep.2018.08.085. PMID: 30257219; PMCID: PMC6295338. \ No newline at end of file diff --git a/docs/tutorial/introduction.rst b/docs/tutorial/introduction.rst index bf1c1bd8f..1484a896c 100644 --- a/docs/tutorial/introduction.rst +++ b/docs/tutorial/introduction.rst @@ -20,43 +20,70 @@ Required software: - `Conda `__ : for managing environments - `Docker `__ : for containerized runs -- `Cytoscape `__ for visualizing networks (download locally, the web version will not suffice) - `Git `__: for cloning the SPRAS repository - A terminal or code editor (`VS Code `__ is recommended, but any terminal will work) +- (Optional) `Cytoscape `__ for visualizing networks (download locally, the web version will not suffice) Required knowledge: -- Basic Python skills +- Ability to run command line operations and modify YAML files. - Basic biology concepts +.. note:: + This tutorial will require downloading approximately 18.3 GB of Docker images and running many Docker containers. + + SPRAS does not automatically clean up these containers or images after execution, so users will need to remove them manually if desired. + + To stop all running containers: ``docker stop $(docker ps -a -q)`` + + To remove all stopped containers: ``docker container prune`` + + To remove unused Docker images: ``docker image prune`` + + ############### SPRAS Overview ############### What is pathway reconstruction? =============================== -Pathway reconstruction is a computational approach used in biology to propose candidate biological pathways (such as signaling pathways) from high-throughput experimental data. -Curated pathway databases provide references to pathways, but they are often generalized and may not capture the context-specific details relevant to a particular disease or experimental condition. -To address this, pathway reconstruction algorithms help map molecules of interest (such as proteins, genes, or metabolites identified in omics experiments) onto large-scale interaction networks, called interactomes (graphs of molecular interactions in a cell). -The result is a customized subnetwork (pathway) that reflects the biology of the specific experiment or condition. +A pathway is a type of graph that describes how different molecules interact with one another for a biological process. -Why use pathway reconstruction? -=============================== -Pathway reconstruction algorithms allow researchers to systematically propose context-specific subnetworks without performing exhaustive experiments testing each individual interaction. -Different algorithms use distinct computational strategies and parameters, providing flexibility to highlight various aspects of the underlying biology and generate new, testable hypotheses giving researchers the flexibility to create and identify different subnetworks specific to their experimental conditions. +Curated pathway databases provide useful well studied references of pathways but are often general or incomplete. +This means they may miss context-specific details relevant to a particular condition or experiment. + +Pathway reconstruction algorithms address this by mapping molecules of interest onto large-scale interaction networks (interactomes) to generate candidate context-specific subnetworks that better reflect the high-throughput experimental data. + +These algorithms allow researchers to propose computational-backed hypothetical subnetworks that capture the unique characteristics of a given context without having to experimentally test every individual interaction. + +Running a single pathway reconstruction algorithm on a single dataset can be challenging, since each algorithm often requires its own input format, software environment, or even a full reimplementation. +These challenges only grow when scaling up to using multiple algorithms and datasets. What is SPRAS? =============== -The Signaling Pathway Reconstruction Analysis Streamliner (SPRAS) is a computational framework that unifies, standardizes, and streamlines the use of diverse pathway reconstructon algorithms. -SPRAS provides an abstraction layer for pathway reconstruction algorithms by organizing every step into a unified schema. It uses workflow management (Snakemake), containerization, and config-driven runs to build modular and interoperable pipelines that cover the entire process: +.. image:: ../_static/spras-overview.png + :alt: spras overview + :width: 600 + :align: center + +.. raw:: html + +
+ +Signaling Pathway Reconstruction Analysis Streamliner (SPRAS) is a computational framework that unifies and simplifies the use of diverse pathway reconstruction algorithms. + +SPRAS allows users to run multiple datasets across multiple algorithms and many parameter settings in a single scalable workflow. +The framework automatically handles data preprocessing, algorithm execution, and post-processing, allowing users to run multiple algorithms seamlessly without manual setup. +Built-in analysis tools enable users to explore, compare, and evaluate reconstructed pathways with ease. + +SPRAS is implemented in Python and leverages two technologies for workflow automation: -1. Pre-processing of data -2. Algorithm execution -3. Post-processing of results -4. Downstream analysis and evaluation +- Snakemake: a workflow management system that defines and executes jobs automatically, removing the need for users to write complex scripts +- Docker: runs algorithms and post analysis in a containerized environment. -A key strength of SPRAS is automation. From user provided input data and configurations, it can generate and execute complete workflows without requiring users to write complex scripts. This lowers the barrier to entry, allowing researchers to apply, evaluate, and compare multiple pathway reconstruction algorithms without deep computational expertise. +A key strength of SPRAS is automation. +From provided input data and configurations, SPRAS can generate and execute complete workflows without requiring users to write complex scripts. +This lowers the barrier to entry, allowing researchers to apply, evaluate, and compare multiple pathway reconstruction algorithms without deep computational expertise. -SPRAS also supports scalable analyses, making it especially valuable for a large number of datasets and systematic investigations. In addition, it provides built-in evaluation and post analysis tools that provide further insights of the algorithm outputs. \ No newline at end of file diff --git a/docs/tutorial/planning.txt b/docs/tutorial/planning.txt deleted file mode 100644 index f45c57ed8..000000000 --- a/docs/tutorial/planning.txt +++ /dev/null @@ -1,95 +0,0 @@ -My current plan for my COMBINE 25 tutorial (that will then be used for the spras doc tutorials) -- I will be testing this on a user that has basic python knowledge but doesn't know spras - -*Tony will be giving a presentation on PRAs and SPRAS prior (hopefully) to this tutorial - -0) -- need to preinstall conda, docker, vscode(?) (or run on terminal), git, and cytoscape prior - - a few minimum dependencies like Docker and conda and git installed already, the rest are optional but recommended -- Basic Python knowledge (running scripts, installing packages) -- Some basic biology knowledge: what is a protein, a protein interaction etc -Overview of PRAs (Pathway Reconstruction Algorithms) in plain language -What SPRAS is and what problems it solves - -My plan is to create subfolders within the output directory so that results are separated by configuration or tutorial section (right now just "basic" and "medium"). -I rely on subfolders when I’m running many different SPRAS tests across datasets or testing different things with different configs, since it helps keep everything organized which is what will be used here. -In the basic case, we will be doing caching for the egfr dataset in its own subfolder. In the medium case, we will be building a new dataset and then running on that in its own subfolder. - - -1) basic -Goal: Get new users comfortable installing SPRAS -Goal: Run SPRAS with one algorthm on a small example dataset, then re-run it with three different parameter settings to see how network structure changes. - -Slides and/or Information to add to docs directly: -Installation & environment setup (Docker Anaconda and Cytoscpare locally) -SPRAS directory structure for a user & configuration files -- config folder -- input folder -- output folder - - can control the creation of subfolders of outputs in the config file - - might need to be in a seperate step/slide -Running SPRAS on a provided example dataset -- set up one algortihm and run -- then run one algortihm with 3 different preset parameter settings -Understanding the outputs -- show the output and structure -- show it visulized -Viewing logs and monitoring runs - -Things to make: -- need one small dataset to run on (either make a dummy one or just use egfr) - - Building around EGFR will be the least amount of work, and that pathway may be a good fit for the COMBINE audience -- basic-config - -2) medium -Goal: Run SPRAS with a couple more algorthm with many parameter settings on a smaller and larger example dataset to see how network structure changes. -Goal: Use more of the post analysis tools -- summary stats -- ML - -Slides and/or Information to add to docs directly: -For the larger dataset, teach a user how to turn data into the input structure we require -- Formatting prize/source/target files and interactome files & explaining biological meaning -Adding multiple PRAs to the workflow -- preset parameters -Use/Show summary stats and ML code -- will be useful for parameter tuning (the next tutorial) - - -Things to make: -- medium-config -- need raw data to make the larger example - - can use panther pathways - - -3) hard -Goal: Run SPRAS with all algortihms with larger example dataset to create reproducible benchmarking experiments using gold standard data. -Goal: use evaluation code -Goal: learn how to parameter tune - -Slides and/or Information to add to docs directly: -Configure the config file with larger dataset -- set all Algorithms datasets, gold standards -Explain parameter tuning -Teach parameter tuning -- Define coarse parameter spaces and define hueristics -- Define fine parameter spaces after and hueristics -- repeat -- show how to use past post analysis tools help with finding parameters - -(maybe I show an example of this rather than doing it in a way the tutorial people do it) -(then people can copy and paste it in and experience it?) - -Explain parameter selection -Use the evaluation code and understanding the outputs -- Per-pathway precision and recall plots / pca chosen and pr curves for ensemble -- Where PCA-based chosen params land vs the full grid -Agreement/disagreement between algorithms (heatmaps) using ML - -Things to make: -- hard-config -- make a hard-config with tuned parameters to provide after people attempt tuning - - -Stuff I need to figure out: -# test these tutorials for mac and windows user \ No newline at end of file