-
Notifications
You must be signed in to change notification settings - Fork 12
Computed features vs PC with new AnnData format #342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
srivarra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main question is, do we want to make full use of AnnData, or is it necessary to save the intermediate features .csv .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can work directly with AnnData.obs and use AnnData.to_df to convert X to a Dataframe.
https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.to_df.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part I am still struggling with is matching the annotations and embedding feature based on the combination of 'track_id', 'fov_name' and 't'. I have to deal with a dataframe for this step as I sort and remove rows based on the match.
| output_file = '/hpc/projects/intracellular_dashboard/organelle_dynamics/2025_07_22_A549_SEC61_TOMM20_G3BP1_ZIKV/4-phenotyping/predictions/quantify_remodeling/G3BP1/feature_list_G3BP1_2025_07_22_192patch.csv' | ||
|
|
||
| # Write to CSV - append if file exists, create new if it doesn't | ||
| position_df.to_csv(output_file, mode='a', header=not os.path.exists(output_file), index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| feature_values = features.filter(like="feature_") | ||
|
|
||
| # compute the PCA features | ||
| pca = PCA(n_components=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we recomputing the PCA features here? Should we instead put these AnnData oriented ones over in dimensionality_reduction.py.
Also is there any functionality from feature.py which we could reuse?
| correlation_df = pd.DataFrame(pca_features, columns=[f"PCA{i+1}" for i in range(pca_features.shape[1])], index=features.index) | ||
|
|
||
| # get the computed features like 'contrast', 'homogeneity', 'energy', 'correlation', 'edge_density', 'organelle_volume', 'organelle_intensity', 'no_organelles', 'size_organelles' | ||
| image_features_df = features.filter(regex="contrast|homogeneity|energy|correlation|edge_density|organelle_volume|organelle_intensity|no_organelles|size_organelles").copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Rename columns to avoid conflicts during merge | ||
| # Rename 't' in features_df_filtered to 'time_point' to match computed_features | ||
| features_df_filtered = features_df_filtered.rename(columns={"t": "time_point"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an issue from the older computations when we created headers of our choice. We have converged to use 't' from now on. I can redo the computed features to have 't' column to solve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of recomputing the computed features, could you just rename the column?
| cell_features = { | ||
| 'fov_name': '/'+well_id+'/'+pos_id, | ||
| 'track_id': row['track_id'], | ||
| 'time_point': timepoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Select columns (features) in the desired order | ||
| feature_order = ["edge_density", "correlation", "energy", "homogeneity", "contrast", "no_organelles", "organelle_volume", "organelle_intensity"] | ||
| # Filter to only include features that actually exist in the dataframe | ||
| feature_order_filtered = [f for f in feature_order if f in correlation_selected.columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check their existence if they are already fixed features from up earlier in the script?
| features_organelle, | ||
| "/hpc/projects/comp.micro/infected_cell_imaging/Single_cell_phenotyping/ContrastiveLearning/Figure_panels/cell_division/PC_vs_CF_2chan_pca_organelle_multiwell.svg", | ||
| computed_features_df, | ||
| "/hpc/projects/intracellular_dashboard/organelle_dynamics/2025_07_22_A549_SEC61_TOMM20_G3BP1_ZIKV/4-phenotyping/predictions/quantify_remodeling/G3BP1/PC_vs_CF_organelle_wellC2_160patch.svg", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@srivarra , the computed feature set depends on the organelle data. At this point I keep changing the set as I work with different organelles. I think a tool like CellProfiler can compute a 1000-feature list, which can be further filtered for the most significant features. I haven't implemented anything like that yet. That is when it will be ready to be added to AnnData as it will be a more constant list. |
Ah gotcha, so keep the |

The code is modified to read from the zarrs with the new anndata format. @srivarra, can you let me know if there is a better way to do this?
@edyoshikun , I have computed the image features outside the library as I used the segmentations of G3BP1 for the image feature computation.