[Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample #25

AstrobioMike · 2024-06-21T19:05:35Z

e.g., to be able to give an estimate stating something like "X% of reads from sample X recruit to the MAGs recovered from sample X" (or put another way, "How much of the starting read data made it through assembly and high-quality binning and is represented by the recovered MAGs?")
contig-level coverage is already generated and provided for individual samples, might be able to piggyback on that to get how much the MAGs capture of the total reads for a given sample
if i don't see an easier way to generate this info from what is already produced, the "long" way could be (for each sample) making new bowtie2 indexes of all recovered MAGs, mapping reads, and parsing/summarizing that
or maybe there's a fancy, quick kmer way to do this that would yield virtually the same info as mapping (e.g., "What proportion of kmers in the reads are found in the MAGs?"). Though maybe that will start to underestimate more and more with increased "intra-population" variation... Will have to bug @ctb about it :)

ctb · 2024-06-21T20:11:15Z

or maybe there's a fancy, quick kmer way to do this that would yield virtually the same info as mapping (e.g., "What proportion of kmers in the reads are found in the MAGs?"). Though maybe that will start to underestimate more and more with increased "intra-population" variation...

Yep! This is precisely what the f_unique_weighted and n_unique_weighted_found columns provide from sourmash gather output. f_unique_weighted is an estimate of the proportion of bases in the total read data set that will map; n_unique_weighted_found is an estimate of the number of bases that will map. And, in case you want it, f_match is the fraction of the MAG that will be covered by mapped reads (aka "detection"). See Irber et al., 2022 for science and algorithms, and sourmash docs for column details. And if you want friendlier text, read from here on down in the FAQ ;).

As you intuit, they are likely a lower bound when using k=21 or k=31; mapping will be a bit more flexible in its matching.

One convenience, should you wish it - if the MAGs have not been dereplicated, you can still use the results just fine. sourmash will not double count reads/bases; see sourmash-bio/sourmash#3188 for more discussion.

Another convenience is that you can take the output of sourmash gather and convert it into a variety of taxonomic reports with sourmash tax metagenome; lmk if you'd like to hear more.

Last but by no means least, we now have multithreaded versions of gather that are wicked fast - sourmash scripts fastgather and sourmash scripts fastmultigather, from the branchwater plugin.

some example commands

mamba create -y -n smash 'sourmash_plugin_branchwater>=0.9.5'
mamba activate smash

sourmash sketch mags*.fasta -p k=31 -o MAGs.sig.zip
sourmash sketch metagenome-R1.fq.gz metagenome-R2.fq.gz -p k=31,abund -o metagenome.sig.zip

sourmash scripts fastgather -c 32 metagenome.sig.zip MAGs.sig.zip -o metag.x.MAGs.csv

and then inspect metag.x.MAGs.csv for the columns above.

HTH, ask questions as you have them!

AstrobioMike · 2024-06-21T21:28:04Z

You rock, @ctb! Thanks for the overview, details, and quick code example! I should randomly ping you more often :)

AstrobioMike added the enhancement label Jun 21, 2024

AstrobioMike self-assigned this Jun 21, 2024

AstrobioMike changed the title ~~capture and report proportion of reads represented by MAGs for each sample~~ [Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample #25

[Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample #25

AstrobioMike commented Jun 21, 2024

ctb commented Jun 21, 2024 •

edited

Loading

AstrobioMike commented Jun 21, 2024

[Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample #25

[Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample #25

Comments

AstrobioMike commented Jun 21, 2024

ctb commented Jun 21, 2024 • edited Loading

some example commands

AstrobioMike commented Jun 21, 2024

ctb commented Jun 21, 2024 •

edited

Loading