Skip to content

Conversation

@tristan-f-r
Copy link
Collaborator

@tristan-f-r tristan-f-r commented Jul 1, 2025

@tristan-f-r tristan-f-r added the refactor Changes that don't actually improve anything except for code quality. label Jul 1, 2025
@ntalluri
Copy link
Collaborator

ntalluri commented Jul 2, 2025

Do you have an example of this (the image) that you could add as a comment?

@ntalluri
Copy link
Collaborator

ntalluri commented Jul 2, 2025

Also could you mention what computer you are using (there is something weird with windows version 10 with scikit learn that we were dealing with 2 years ago but never fixed #135)

@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jul 2, 2025

NixOS 25.05.20250108.bffc22e (Warbler) on x86_64, though I do think this issue is reproducing on CI, so I doubt this is platform dependent. I'll reproduce it.

@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jul 2, 2025

@ntalluri

spras/test/ml/test_ml.py

Lines 83 to 95 in 54bd449

def test_pca_robustness(self):
dataframe = ml.summarize_networks([INPUT_DIR + 'test-data-s1/s1.txt', INPUT_DIR + 'test-data-s2/s2.txt', INPUT_DIR + 'test-data-s3/s3.txt'])
expected = pd.read_table(EXPECT_DIR + 'expected-pca-coordinates.tsv')
expected = expected.round(5)
for _ in range(5):
dataframe_shuffled = dataframe.sample(frac=1, axis=1) # permute the columns
ml.pca(dataframe_shuffled, OUT_DIR + 'pca-shuffled-columns.png', OUT_DIR + 'pca-shuffled-columns-variance.txt',
OUT_DIR + 'pca-shuffled-columns-coordinates.tsv')
coord = pd.read_table(OUT_DIR + 'pca-shuffled-columns-coordinates.tsv')
coord = coord.round(5) # round values to 5 digits to account for numeric differences across machines
coord.sort_values(by='algorithm', ignore_index=True, inplace=True)
assert coord.equals(expected)

Coord:
      algorithm      PC1      PC2
0  test-data-s1 -2.00665 -0.98659
1  test-data-s2 -1.52765  1.07995
2  test-data-s3  3.53430 -0.09336       algorithm      PC1      PC2
0  test-data-s1 -2.00665 -0.98659
1  test-data-s2 -1.52765  1.07995
2  test-data-s3  3.53430 -0.09336

Expected:
      algorithm      PC1      PC2
0  test-data-s1  2.00665 -0.98659
1  test-data-s2  1.52765  1.07995
2  test-data-s3 -3.53430 -0.09336       algorithm      PC1      PC2
0  test-data-s1 -2.00665 -0.98659
1  test-data-s2 -1.52765  1.07995
2  test-data-s3  3.53430 -0.09336

@tristan-f-r
Copy link
Collaborator Author

@ntalluri
Copy link
Collaborator

ntalluri commented Jul 3, 2025

Based on what I’m reading, bumping the version is a good idea because it improves PCA in a meaningful way. Before, PCA (in version 1.2) signs were chosen by looking at how the data looked after projection (transformed data), but now they are chosen by inspecting each component vector directly. Previously, each solver decided the signs in its own way, which meant that changing the solver could cause components to unexpectedly flip signs. While the mathematical results seem to always still be correct, this causes confusion. The new approach in version 1.7 makes the sign choice independent of the solver because it relies only on the component vector’s own values, ensuring we always get consistent component signs no matter which solver we use.

We don't allow a user to pick a solver, we have it set to auto at the moment.
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA:~:text=svd_solver%7B%E2%80%98auto%E2%80%99%2C%20%E2%80%98full,optionally%20truncated%20afterwards.

tristan-f-r and others added 2 commits July 3, 2025 09:42
untested; need to open my devcontainer
@tristan-f-r
Copy link
Collaborator Author

Let me know if this commit is the right way to approach that. To my understanding, this does mess with our sampling test because our component vectors differ per run; hopefully, the aforementioned commit addresses that.

@tristan-f-r
Copy link
Collaborator Author

Blocked by #265 - we should discuss removing GraphSpace support in today's meeting.

@ntalluri
Copy link
Collaborator

ntalluri commented Jul 3, 2025

Let me know if this commit is the right way to approach that. To my understanding, this does mess with our sampling test because our component vectors differ per run; hopefully, the aforementioned commit addresses that.

For that specific test, we want the output to be the same despite shuffling the rows and columns. So having 2 different outputs is not what we would want.

@tristan-f-r
Copy link
Collaborator Author

For that specific test, we want the output to be the same despite shuffling the rows and columns. So having 2 different outputs is not what we would want.

Hm, do we roll back scikit-learn then? The PCA signage is going to start depending on the content, and it seems that permutations count as a mutation over the content.

@ntalluri
Copy link
Collaborator

ntalluri commented Jul 7, 2025

Maybe we need to rethink the test then. Is it okay the values are the same but the signs change? Or should it always be the same values no matter the data? (Just some questions to ask). I will think through what we can do for this test case.

@tristan-f-r tristan-f-r marked this pull request as ready for review July 8, 2025 20:11
@agitter
Copy link
Collaborator

agitter commented Jul 13, 2025

Is it okay the values are the same but the signs change?

Can you help me get caught up on the core question? It seems valid to have a transformation of the PCA solution where the signs are flipped.

@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jul 13, 2025

Can you help me get caught up on the core question? It seems valid to have a transformation of the PCA solution where the signs are flipped.

The essence of the test that was broken here was row shuffling preserving PCA output:

dataframe_shuffled = dataframe.sample(frac=1, axis=1)  # permute the columns
ml.pca(dataframe_shuffled, OUT_DIR + 'pca-shuffled-columns.png', OUT_DIR + 'pca-shuffled-columns-variance.txt',
    OUT_DIR + 'pca-shuffled-columns-coordinates.tsv')
coord = pd.read_table(OUT_DIR + 'pca-shuffled-columns-coordinates.tsv')
coord = coord.round(5)  # round values to 5 digits to account for numeric differences across machines
coord.sort_values(by='algorithm', ignore_index=True, inplace=True)

This now needs an assertion check for two possible signings:

assert coord.equals(expected) or coord.equals(expected_other)

@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jul 18, 2025

Since this is just a standard dependency bump, this seems okay to merge. I do want one last review to check in that we're okay with this PCA output signage change. Will be reviewed on Monday.

@tristan-f-r tristan-f-r requested a review from ntalluri July 18, 2025 20:25
@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jul 21, 2025

@tristan-f-r Which PR was it? The ensembling one?

I think so? I'll take a better look at the blame to find the exact commit, if that's helpful. (Regardless, we probably should stop committing artifacts to main anyway - this causes problems; see #320 for an example of a large test refactor because of a related issue)

@ntalluri
Copy link
Collaborator

If it is the ensembling one, there was a data.pickle I added that has a toy dataset network that is needed for ensembling in general that we use to ensure we calculate recall correct. I'm not sure why that would need to be updated (if it is this pickled dataset) because it has nothing to do with the pca.

@agitter
Copy link
Collaborator

agitter commented Jul 21, 2025

Regardless, we probably should stop committing artifacts to main anyway - this causes problems; see #320 for an example of a large test refactor because of a related issue

This seems worthwhile to discuss in a new issue as a testing strategy change if it affects multiple tests.

@tristan-f-r
Copy link
Collaborator Author

#339

@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jul 21, 2025

@tristan-f-r Which PR was it? The ensembling one?

👍 It was #212 (origin commit). I'll prepare another PR to make that test follow #339.

@tristan-f-r
Copy link
Collaborator Author

If it is the ensembling one, there was a data.pickle I added that has a toy dataset network that is needed for ensembling in general that we use to ensure we calculate recall correct. I'm not sure why that would need to be updated (if it is this pickled dataset) because it has nothing to do with the pca.

Pickled objects contain internal representations of the object, and are subject to change throughout package versions. In this case, pandas itself changed.

@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jul 21, 2025

#340 should fix this.

@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jul 23, 2025

@ntalluri there seems to be sorting issues with the new commits making the PCA test flaky (see this commit which addresses that) - the tests only worked occasionally on my machine before that commit.

@tristan-f-r tristan-f-r requested a review from ntalluri July 23, 2025 18:23
Copy link
Collaborator

@ntalluri ntalluri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. However, I am going to test the environment updates locally before approving.

@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jul 23, 2025

By the way, some dependencies may be a little outdated (e.g. this PR outlived a pandas release cycle) - I can update them, but this PR was mostly intended to avoid all of the dependency errors that were causing me problems with pixi.

@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jul 24, 2025

Okay - sorry about this terribly long hill of a PR for how small it is. That should be the last commit which fixes the tests from the latest merge 👍

@tristan-f-r tristan-f-r requested a review from ntalluri July 24, 2025 19:29
Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the TestML tests locally and they passed

@tristan-f-r tristan-f-r dismissed ntalluri’s stale review July 25, 2025 15:42

pre-commit is too outdated to run typos, we need to bump pre-commit

@tristan-f-r tristan-f-r merged commit e899da8 into Reed-CompBio:main Jul 25, 2025
15 checks passed
@tristan-f-r tristan-f-r deleted the dep-bump branch July 25, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor Changes that don't actually improve anything except for code quality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Different dependency version matching between conda and python Clustering warning in sklearn version 1.2

3 participants