Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample #25

Open
AstrobioMike opened this issue Jun 21, 2024 · 2 comments
Assignees

Comments

@AstrobioMike
Copy link
Owner

  • e.g., to be able to give an estimate stating something like "X% of reads from sample X recruit to the MAGs recovered from sample X" (or put another way, "How much of the starting read data made it through assembly and high-quality binning and is represented by the recovered MAGs?")
  • contig-level coverage is already generated and provided for individual samples, might be able to piggyback on that to get how much the MAGs capture of the total reads for a given sample
  • if i don't see an easier way to generate this info from what is already produced, the "long" way could be (for each sample) making new bowtie2 indexes of all recovered MAGs, mapping reads, and parsing/summarizing that
  • or maybe there's a fancy, quick kmer way to do this that would yield virtually the same info as mapping (e.g., "What proportion of kmers in the reads are found in the MAGs?"). Though maybe that will start to underestimate more and more with increased "intra-population" variation... Will have to bug @ctb about it :)
@AstrobioMike AstrobioMike self-assigned this Jun 21, 2024
@AstrobioMike AstrobioMike changed the title capture and report proportion of reads represented by MAGs for each sample [Metagenomics wf] capture and report proportion of reads represented by MAGs for each sample Jun 21, 2024
@ctb
Copy link

ctb commented Jun 21, 2024

  • or maybe there's a fancy, quick kmer way to do this that would yield virtually the same info as mapping (e.g., "What proportion of kmers in the reads are found in the MAGs?"). Though maybe that will start to underestimate more and more with increased "intra-population" variation...

Yep! This is precisely what the f_unique_weighted and n_unique_weighted_found columns provide from sourmash gather output. f_unique_weighted is an estimate of the proportion of bases in the total read data set that will map; n_unique_weighted_found is an estimate of the number of bases that will map. And, in case you want it, f_match is the fraction of the MAG that will be covered by mapped reads (aka "detection"). See Irber et al., 2022 for science and algorithms, and sourmash docs for column details. And if you want friendlier text, read from here on down in the FAQ ;).

As you intuit, they are likely a lower bound when using k=21 or k=31; mapping will be a bit more flexible in its matching.

One convenience, should you wish it - if the MAGs have not been dereplicated, you can still use the results just fine. sourmash will not double count reads/bases; see sourmash-bio/sourmash#3188 for more discussion.

Another convenience is that you can take the output of sourmash gather and convert it into a variety of taxonomic reports with sourmash tax metagenome; lmk if you'd like to hear more.

Last but by no means least, we now have multithreaded versions of gather that are wicked fast - sourmash scripts fastgather and sourmash scripts fastmultigather, from the branchwater plugin.

some example commands

mamba create -y -n smash 'sourmash_plugin_branchwater>=0.9.5'
mamba activate smash

sourmash sketch mags*.fasta -p k=31 -o MAGs.sig.zip
sourmash sketch metagenome-R1.fq.gz metagenome-R2.fq.gz -p k=31,abund -o metagenome.sig.zip

sourmash scripts fastgather -c 32 metagenome.sig.zip MAGs.sig.zip -o metag.x.MAGs.csv

and then inspect metag.x.MAGs.csv for the columns above.

HTH, ask questions as you have them!

@AstrobioMike
Copy link
Owner Author

You rock, @ctb! Thanks for the overview, details, and quick code example! I should randomly ping you more often :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants