Skip to content

Conversation

@tristan-f-r
Copy link
Contributor

@tristan-f-r tristan-f-r commented Aug 23, 2025

I am having some trouble with replication. A few questions:

  • Is the script for generating phosphosite-irefindex13.0-uniprot.txt still around? I'm having trouble even coming close to reproducing that file. (Note that while iRefIndex is down, a lot of data is still available on the wayback machine - this may be a good canidate for storing in OSDF as mentioned in the HIV PR.)
  • I can not replicate the rounding procedure for prizes - these almost look like floating point errors. Was there any special trunc function calls done for these prize files?

I'm currently reversing the peptide-mapping.tsv file.

@tristan-f-r tristan-f-r added the dataset Mutating datasets in any way. label Aug 23, 2025
@tristan-f-r tristan-f-r changed the title dataset: egfr dataset: EGFR Aug 23, 2025
@agitter
Copy link
Collaborator

agitter commented Aug 23, 2025

I think this pull request is out of scope for SPRAS. SPRAS can accept that the TPS paper did the processing it did of the original files and hosted those in its supplementary datasets or GitHub repository. Recreating those pipelines creates a lot of extra work for us.

Some of the TPS analysis was done in Scala so that could explain floating point differences.

phosphosite-irefindex13.0-uniprot.txt was created ~12 years ago in the Fraenkel lab at MIT. That may indicate we should stop using it for SPRAS. I have archives of some (all?) of the scripts used for network processing at that time. I pushed them to a private GitHub repo for preservation. However, I don't want to make it public right now because I don't have permission from the authors and haven't tracked down licensing terms for all of the data files in that repo.

@tristan-f-r
Copy link
Contributor Author

This being out of scope is what I suspected. I was hoping to be able to get enough scripts to be able to update the data using the more recent data sources, but the phosphosite-irefinded PPI world have been the most affected file.

@agitter
Copy link
Collaborator

agitter commented Aug 23, 2025

I added a note in my lab's fork of the TPS repo, which has more recent activity than the upstream copy, about the origin of the network to help ensure I don't forget: gitter-lab/tps#9

@agitter
Copy link
Collaborator

agitter commented Oct 10, 2025

@ntalluri had questions about the data normalization and statistical testing for the phosphoproteomics data in the EGFR dataset. I added scripts and details about that in gitter-lab/tps#10.

However, even with that as a reference, her attempt to reanalyze the data with Python in 2025 still gives different results than the original analysis in R in 2014. That may be expected due to differences in languages and statistical packages.

@tristan-f-r
Copy link
Contributor Author

What are the differences? If it's small, it could be the floating point err mentioned above.

@agitter
Copy link
Collaborator

agitter commented Oct 10, 2025

Neha can correct me if needed, but my understanding is that it was more fundamental. The R version of the Tukey test was fitting an ANOVA model first and the Python version was not. The statistical test itself in the available packages was implemented differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Mutating datasets in any way.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants