General changes to generate negative samples #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Wrote this quickly before I left Friday afternoon and forgot that I didn't submit the pull request, sorry
We talked about making a more general solution possible, so I believe the changes should allow for the original functionality and the functionality that I need in BEELINE. I also bumped down the python dependencies to a version BEELINE would allow - I'm not sure if this breaks any other function, to my knowledge it doesn't and through some limited testing of the splitting, verification and negative sample generation. I removed the parameters about a graph being undirected/source column, but can put those back in if you're planning to implement that as a feature.
Also, the changes fix (what I believe to be) a minor bug in the random selection of negative samples. Specifically, that for two edges with the same exact target set, that they will always choose the same targets since the sampling is based off a set seed for reproducibility that does not change. For example, TF a, b occur once in a dataset and target the same gene c - the negative sample generated will be the same always - (a, random) = (b, random). Unlikely to be a problem at all in most datasets, but I simply changed the seed per gene pair iterated over. This will still result in reproducibility, should just ensure "randomness".
Let me know if this works, can make any changes. I was also thinking it may be a good idea to set up an auto export to PyPI with a GitHub action, and I can look into doing that if you think it would be easier to maintain. Thanks for the help with these scripts!
Tim