You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
e.g., to be able to give an estimate stating something like "X% of reads from sample X recruit to the MAGs recovered from sample X" (or put another way, "How much of the starting read data made it through assembly and high-quality binning and is represented by the recovered MAGs?")
contig-level coverage is already generated and provided for individual samples, might be able to piggyback on that to get how much the MAGs capture of the total reads for a given sample
if i don't see an easier way to generate this info from what is already produced, the "long" way could be (for each sample) making new bowtie2 indexes of all recovered MAGs, mapping reads, and parsing/summarizing that
or maybe there's a fancy, quick kmer way to do this that would yield virtually the same info as mapping (e.g., "What proportion of kmers in the reads are found in the MAGs?"). Though maybe that will start to underestimate more and more with increased "intra-population" variation... Will have to bug @ctb about it :)
The text was updated successfully, but these errors were encountered:
AstrobioMike
changed the title
capture and report proportion of reads represented by MAGs for each sample
[Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample
Jun 21, 2024
or maybe there's a fancy, quick kmer way to do this that would yield virtually the same info as mapping (e.g., "What proportion of kmers in the reads are found in the MAGs?"). Though maybe that will start to underestimate more and more with increased "intra-population" variation...
Yep! This is precisely what the f_unique_weighted and n_unique_weighted_found columns provide from sourmash gather output. f_unique_weighted is an estimate of the proportion of bases in the total read data set that will map; n_unique_weighted_found is an estimate of the number of bases that will map. And, in case you want it, f_match is the fraction of the MAG that will be covered by mapped reads (aka "detection"). See Irber et al., 2022 for science and algorithms, and sourmash docs for column details. And if you want friendlier text, read from here on down in the FAQ ;).
As you intuit, they are likely a lower bound when using k=21 or k=31; mapping will be a bit more flexible in its matching.
One convenience, should you wish it - if the MAGs have not been dereplicated, you can still use the results just fine. sourmash will not double count reads/bases; see sourmash-bio/sourmash#3188 for more discussion.
Another convenience is that you can take the output of sourmash gather and convert it into a variety of taxonomic reports with sourmash tax metagenome; lmk if you'd like to hear more.
Last but by no means least, we now have multithreaded versions of gather that are wicked fast - sourmash scripts fastgather and sourmash scripts fastmultigather, from the branchwater plugin.
The text was updated successfully, but these errors were encountered: