Output PCA data? #30

michaelsteidel86 · 2024-06-13T19:36:39Z

Hi,

is there a way to generate an output with the actual data from PCA in a flatfile format instead of plot only?

Thanks
Michael

ftwkoopmans · 2024-06-13T23:29:34Z

Currently, the function that generates the report PDF does not retain the data and ggplot objects for figures shown in the PDF (it's on the wishlist for future updates). So you can't easily extract the PCA plot's coordinates.

However, after running the dataset = analysis_quickstart(...) function call we should be able to compute the PCA dimensions by using similar code to the built-in MS-DAP functions. I've composed a code snippet for you below, assuming you ran analysis_quickstart() and your dataset object is called dataset, this code should yield PCA dimensions and a plot that is the same (except for size and automatic color-codings) to the PDF report.

# peptide level data matrix, using filtered+normalized peptides across all groups
tibw_noexclude = dataset$peptides %>% 
  select(peptide_id, sample_id, intensity_all_group) %>%
  filter(!is.na(intensity_all_group)) %>%
  pivot_wider(id_cols = peptide_id, names_from = sample_id, values_from = intensity_all_group)
matrix_sample_intensities = msdap:::as_matrix_except_first_column(tibw_noexclude)

# compute PCA
PPCA = pcaMethods::pca(base::scale(t(matrix_sample_intensities), center=TRUE, scale=FALSE), method="ppca", nPcs = 3, seed = 123, maxIterations = 2000)
mat_pca = PPCA@scores # PCA coordinate matrix; rownames = sample_id and colnames = PCA dimension
pca_var = array(PPCA@R2, dimnames = list(colnames(PPCA@scores))) # variation per dimension

# if you only want a table with the PCA coordinates, just print the mat_pca variable and you're done at this point

# example plot code
dims = c(1,2) # plot dimensions. in this example, can be; 1:2, 2:3, c(1,3)
p = ggplot(data.frame(x=mat_pca[,dims[1]], y=mat_pca[,dims[2]], label=rownames(mat_pca)), 
           aes(x = x, y = y, label = label)) +
  geom_point() +
  geom_text() +
  labs(x = sprintf("dimension %d (%.1f%%)", dims[1], pca_var[dims[1]] * 100),
       y = sprintf("dimension %d (%.1f%%)", dims[2], pca_var[dims[2]] * 100) ) +
  theme_bw()
print(p)

Does this address your question?

michaelsteidel86 · 2024-06-14T08:07:28Z

Thanks a lot Frank :)

Indeed this was exactly what I was looking for.

One further question: If I got this correctly this PCA is based on peptide-level evidence. Is it possible to generate it based on protein level data?

Thanks
Michael

michaelsteidel86 · 2024-06-14T10:00:03Z

Hi Frank,
I took note that the given function is able to extract most but not all datapoints as shown in the built in PCA viz. Notably in my example there are 2 datapoints which are extracted with completely different coordinates (flagged in 'red', left: bilt in PCA viz, right: re-calculated based on your functions), which does not make sense from the known batch effect...

Do you know why this happend here, and what I could do to really extract the data as shown in the "original" PCA?

Thanks in advance
Michael

ftwkoopmans · 2024-06-14T13:47:59Z

One further question: If I got this correctly this PCA is based on peptide-level evidence. Is it possible to generate it based on protein level data?

Indeed, the PCA in that example is based on peptide intensity values. If you want to compute protein intensities from the same peptide-level data, you can use the MaxLFQ algorithm to "rollup" or "summarize" peptides to protein-level. Below snippet shows how to do this using the MS-DAP function rollup_pep2prot (assuming you did library(msdap) sometime earlier). You can plug the resulting matrix_sample_intensities variable into the same PCA plotting code that was shared earlier.

tib_noexclude = dataset$peptides %>% 
  filter(!is.na(intensity_all_group)) %>%
  select(peptide_id, protein_id, sample_id, intensity = intensity_all_group)
matrix_sample_intensities = rollup_pep2prot(tib_noexclude, intensity_is_log2 = TRUE, rollup_algorithm = "maxlfq", return_as_matrix = TRUE)

ftwkoopmans · 2024-06-14T13:51:43Z

Hi Frank, I took note that the given function is able to extract most but not all datapoints as shown in the built in PCA viz. Notably in my example there are 2 datapoints which are extracted with completely different coordinates (flagged in 'red', left: bilt in PCA viz, right: re-calculated based on your functions), which does not make sense from the known batch effect...

Do you know why this happend here, and what I could do to really extract the data as shown in the "original" PCA?

The coordinates in both PCA figures look quite similar to me by eye but without having access to your dataset it's impossible for me to figure out where the descrepancy between both figures is coming from. For instance, is the PCA function shared in my earlier post giving slightly different coordinates than the report PDF, or is something amiss with the color-coding / matching-to-sample-ids in your example plot/code? (ergo, the PCA coords are the same but sample labeling is not)

If you're having a hard time debugging this, feel free to email your report.pdf and dataset.RData files (both files are in the MS-DAP output folder, please share via dropbox or something) to me and I'll try to reproduce the PCA in the report using your dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output PCA data? #30

Output PCA data? #30

michaelsteidel86 commented Jun 13, 2024

ftwkoopmans commented Jun 13, 2024

michaelsteidel86 commented Jun 14, 2024

michaelsteidel86 commented Jun 14, 2024

ftwkoopmans commented Jun 14, 2024

ftwkoopmans commented Jun 14, 2024

Output PCA data? #30

Output PCA data? #30

Comments

michaelsteidel86 commented Jun 13, 2024

ftwkoopmans commented Jun 13, 2024

michaelsteidel86 commented Jun 14, 2024

michaelsteidel86 commented Jun 14, 2024

ftwkoopmans commented Jun 14, 2024

ftwkoopmans commented Jun 14, 2024