Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output PCA data? #30

Open
michaelsteidel86 opened this issue Jun 13, 2024 · 5 comments
Open

Output PCA data? #30

michaelsteidel86 opened this issue Jun 13, 2024 · 5 comments

Comments

@michaelsteidel86
Copy link

Hi,

is there a way to generate an output with the actual data from PCA in a flatfile format instead of plot only?

Thanks
Michael

@ftwkoopmans
Copy link
Owner

Currently, the function that generates the report PDF does not retain the data and ggplot objects for figures shown in the PDF (it's on the wishlist for future updates). So you can't easily extract the PCA plot's coordinates.

However, after running the dataset = analysis_quickstart(...) function call we should be able to compute the PCA dimensions by using similar code to the built-in MS-DAP functions. I've composed a code snippet for you below, assuming you ran analysis_quickstart() and your dataset object is called dataset, this code should yield PCA dimensions and a plot that is the same (except for size and automatic color-codings) to the PDF report.

# peptide level data matrix, using filtered+normalized peptides across all groups
tibw_noexclude = dataset$peptides %>% 
  select(peptide_id, sample_id, intensity_all_group) %>%
  filter(!is.na(intensity_all_group)) %>%
  pivot_wider(id_cols = peptide_id, names_from = sample_id, values_from = intensity_all_group)
matrix_sample_intensities = msdap:::as_matrix_except_first_column(tibw_noexclude)

# compute PCA
PPCA = pcaMethods::pca(base::scale(t(matrix_sample_intensities), center=TRUE, scale=FALSE), method="ppca", nPcs = 3, seed = 123, maxIterations = 2000)
mat_pca = PPCA@scores # PCA coordinate matrix; rownames = sample_id and colnames = PCA dimension
pca_var = array(PPCA@R2, dimnames = list(colnames(PPCA@scores))) # variation per dimension

# if you only want a table with the PCA coordinates, just print the mat_pca variable and you're done at this point

# example plot code
dims = c(1,2) # plot dimensions. in this example, can be; 1:2, 2:3, c(1,3)
p = ggplot(data.frame(x=mat_pca[,dims[1]], y=mat_pca[,dims[2]], label=rownames(mat_pca)), 
           aes(x = x, y = y, label = label)) +
  geom_point() +
  geom_text() +
  labs(x = sprintf("dimension %d (%.1f%%)", dims[1], pca_var[dims[1]] * 100),
       y = sprintf("dimension %d (%.1f%%)", dims[2], pca_var[dims[2]] * 100) ) +
  theme_bw()
print(p)

Does this address your question?

@michaelsteidel86
Copy link
Author

Thanks a lot Frank :)

Indeed this was exactly what I was looking for.

One further question: If I got this correctly this PCA is based on peptide-level evidence. Is it possible to generate it based on protein level data?

Thanks
Michael

@michaelsteidel86
Copy link
Author

Hi Frank,
I took note that the given function is able to extract most but not all datapoints as shown in the built in PCA viz. Notably in my example there are 2 datapoints which are extracted with completely different coordinates (flagged in 'red', left: bilt in PCA viz, right: re-calculated based on your functions), which does not make sense from the known batch effect...

Do you know why this happend here, and what I could do to really extract the data as shown in the "original" PCA?

Thanks in advance
Michael

image

@ftwkoopmans
Copy link
Owner

One further question: If I got this correctly this PCA is based on peptide-level evidence. Is it possible to generate it based on protein level data?

Indeed, the PCA in that example is based on peptide intensity values. If you want to compute protein intensities from the same peptide-level data, you can use the MaxLFQ algorithm to "rollup" or "summarize" peptides to protein-level. Below snippet shows how to do this using the MS-DAP function rollup_pep2prot (assuming you did library(msdap) sometime earlier). You can plug the resulting matrix_sample_intensities variable into the same PCA plotting code that was shared earlier.

tib_noexclude = dataset$peptides %>% 
  filter(!is.na(intensity_all_group)) %>%
  select(peptide_id, protein_id, sample_id, intensity = intensity_all_group)
matrix_sample_intensities = rollup_pep2prot(tib_noexclude, intensity_is_log2 = TRUE, rollup_algorithm = "maxlfq", return_as_matrix = TRUE)

@ftwkoopmans
Copy link
Owner

Hi Frank, I took note that the given function is able to extract most but not all datapoints as shown in the built in PCA viz. Notably in my example there are 2 datapoints which are extracted with completely different coordinates (flagged in 'red', left: bilt in PCA viz, right: re-calculated based on your functions), which does not make sense from the known batch effect...

Do you know why this happend here, and what I could do to really extract the data as shown in the "original" PCA?

The coordinates in both PCA figures look quite similar to me by eye but without having access to your dataset it's impossible for me to figure out where the descrepancy between both figures is coming from. For instance, is the PCA function shared in my earlier post giving slightly different coordinates than the report PDF, or is something amiss with the color-coding / matching-to-sample-ids in your example plot/code? (ergo, the PCA coords are the same but sample labeling is not)

If you're having a hard time debugging this, feel free to email your report.pdf and dataset.RData files (both files are in the MS-DAP output folder, please share via dropbox or something) to me and I'll try to reproduce the PCA in the report using your dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants