-
Notifications
You must be signed in to change notification settings - Fork 9
dataset: synthetic from PANTHER #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-Authored-By: Neha Talluri <[email protected]> Co-Authored-By: Oliver Faulkner Anderson <[email protected]> Co-Authored-By: Altaf Barelvi <[email protected]>
| @@ -0,0 +1,100 @@ | |||
| pathways = ["Apoptosis_signaling", "B_cell_activation", | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each of the files have this variable. I think we should have it only in the snakefile and send this list to each of the files that use this pathway list
| "FGF_signaling", "Interferon_gamma_signaling", | ||
| "JAK_STAT_signaling", "VEGF_signaling"] | ||
| # TODO: deduplicate this from thresholding scripts by passing it in? | ||
| thresholds = [1, 100, 200, 300, 400, 500, 600, 700, 800, 900] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar thing for the thresholds
|
A question I’d appreciate feedback on: Currently, we generate separate source, target, and prize files for each pathway, but we combine all pathways into each thresholded interactome. Should we also create a combined list of sources, targets, and prizes? Should we also combine the gold standard as well? Or would it be better to keep separate interactomes for each individual pathway (keep it the way it is)? |
|
We should have separate gold standards. |
|
When this is reviewed (or before) we should do tests to see how connected the networks are after thresholding, adding back the pathway data, and removing proteins that don't have uniprot ids. |
|
Also there is a chance we can use more panther pathways, we should look to see what else we can use from pathway commons. |
|
@oliverfanderson @ctrlaltaf For the gold standard nodes (and potentially the edges), should we exclude source, target, and prize nodes when defining it? Currently, it looks like we’re including these nodes in the gold standard for each pathway. These nodes overlap with the gold standard, but that overlap should happen naturally, not by construction/being predefined. I’m concerned this could inflate our precision and recall metrics, because of a form of data leakage. |
Plan to keep all of them in the gold standard. But update the evaluation code to deal with the sources/targets/prizes being in the gold standard and shown as a different baseline where those are all set as frequency 1.0. |
|
Should we also consider how sparse an interactome becomes after applying a threshold to the STRING interactome? When we filter by size, we implicitly accounting for the decrease in graph density as well. Would it make more sense to treat size and density as separate variables when evaluating performance? However, does testing for density even matter in this context; are there any interactomes that aren’t already highly connected? I’m thinking we should first threshold the interactomes, then select only those that are highly connected (e.g., density ≥ 0.85). From that subset, we could choose a few to represent different size scales. |
|
I will be updating how we create interactomes for the Panther pathways dataset. Current: New: For example, in the STRING interaction networks, when using only physical interactions and experimental edge scores, we could aim to keep 25% of all edges.
Now we will be construct new interactomes by removing X% of edges and then adding all edges from all chosen PANTHER pathways. We will only keep downsampled interacomes that satisfy specified properties for a given set of sources and targets. Proposed brute-force method for Panther pathways interactomes:
Randomly remove X edges from the full STRING interactome
Verify that the new network maintains the following properties:
If the properties above are not satisfied, repeat the process with a different random sample. |
|
For this dataset, we are planning on using it for all of the evaluations. I was deciding if we need to use all of the pathways, and I don't think we need to. I decided on a couple that we can use: Balanced Skewed Tiny When making the interactomes, I want to add all of these pathways on the thresholded interactomes and uphold the properties above. I need to double check if I used any of these will break the rules for pilot data/runs; but since we are making a new dataset that wasn't used for my thesis, I think we will be okay. |
This does not add anything to
config/*.yaml.Co-Authored-By: Neha Talluri [email protected]
Co-Authored-By: Oliver Faulkner Anderson [email protected]
Co-Authored-By: Altaf Barelvi [email protected]