Skip to content

Conversation

@Soorya19Pradeep
Copy link
Contributor

The code is modified to read from the zarrs with the new anndata format. @srivarra, can you let me know if there is a better way to do this?

@edyoshikun , I have computed the image features outside the library as I used the segmentations of G3BP1 for the image feature computation.

Copy link
Contributor

@srivarra srivarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main question is, do we want to make full use of AnnData, or is it necessary to save the intermediate features .csv .

Copy link
Contributor

@srivarra srivarra Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(blocking)
We can work directly with AnnData.obs and use AnnData.to_df to convert X to a Dataframe.

https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.to_df.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part I am still struggling with is matching the annotations and embedding feature based on the combination of 'track_id', 'fov_name' and 't'. I have to deal with a dataframe for this step as I sort and remove rows based on the match.

output_file = '/hpc/projects/intracellular_dashboard/organelle_dynamics/2025_07_22_A549_SEC61_TOMM20_G3BP1_ZIKV/4-phenotyping/predictions/quantify_remodeling/G3BP1/feature_list_G3BP1_2025_07_22_192patch.csv'

# Write to CSV - append if file exists, create new if it doesn't
position_df.to_csv(output_file, mode='a', header=not os.path.exists(output_file), index=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question(blocking)

Why do we need to write out an intermediate CSV? We can directly append to the AnnData object (assuming if it exists). Should we assume the AnnData object already exists before users start computing image features?

feature_values = features.filter(like="feature_")

# compute the PCA features
pca = PCA(n_components=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question(blocking)

Are we recomputing the PCA features here? Should we instead put these AnnData oriented ones over in dimensionality_reduction.py.

Also is there any functionality from feature.py which we could reuse?

correlation_df = pd.DataFrame(pca_features, columns=[f"PCA{i+1}" for i in range(pca_features.shape[1])], index=features.index)

# get the computed features like 'contrast', 'homogeneity', 'energy', 'correlation', 'edge_density', 'organelle_volume', 'organelle_intensity', 'no_organelles', 'size_organelles'
image_features_df = features.filter(regex="contrast|homogeneity|energy|correlation|edge_density|organelle_volume|organelle_intensity|no_organelles|size_organelles").copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought(if-minor)
If the features are fixed, like in compute_image_features then I don't think we need regex right? we can just select those columns directly.

Comment on lines +49 to +51
# Rename columns to avoid conflicts during merge
# Rename 't' in features_df_filtered to 'time_point' to match computed_features
features_df_filtered = features_df_filtered.rename(columns={"t": "time_point"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick(non-blocking)
We should instead change compute_image_features to use t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an issue from the older computations when we created headers of our choice. We have converged to use 't' from now on. I can redo the computed features to have 't' column to solve this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of recomputing the computed features, could you just rename the column?

cell_features = {
'fov_name': '/'+well_id+'/'+pos_id,
'track_id': row['track_id'],
'time_point': timepoint,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick(non-blocking)
We should use t instead of time_point to match the rest of codebase.

# Select columns (features) in the desired order
feature_order = ["edge_density", "correlation", "energy", "homogeneity", "contrast", "no_organelles", "organelle_volume", "organelle_intensity"]
# Filter to only include features that actually exist in the dataframe
feature_order_filtered = [f for f in feature_order if f in correlation_selected.columns]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check their existence if they are already fixed features from up earlier in the script?

features_organelle,
"/hpc/projects/comp.micro/infected_cell_imaging/Single_cell_phenotyping/ContrastiveLearning/Figure_panels/cell_division/PC_vs_CF_2chan_pca_organelle_multiwell.svg",
computed_features_df,
"/hpc/projects/intracellular_dashboard/organelle_dynamics/2025_07_22_A549_SEC61_TOMM20_G3BP1_ZIKV/4-phenotyping/predictions/quantify_remodeling/G3BP1/PC_vs_CF_organelle_wellC2_160patch.svg",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought(non-blocking)

Saving this in the comment for me for later

Image

@Soorya19Pradeep
Copy link
Contributor Author

@srivarra , the computed feature set depends on the organelle data. At this point I keep changing the set as I work with different organelles. I think a tool like CellProfiler can compute a 1000-feature list, which can be further filtered for the most significant features. I haven't implemented anything like that yet. That is when it will be ready to be added to AnnData as it will be a more constant list.

@srivarra
Copy link
Contributor

@Soorya19Pradeep

the computed feature set depends on the organelle data. At this point I keep changing the set as I work with different organelles.

Ah gotcha, so keep the csv output for quick iterations / until we know exactly what features we want?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants