When running StreamingQueryDNADatabase.py, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to:
- Dump all the training database sketch k-mers using KMC (like it's done here
- Use KMC to count the k-mers in the sample
- Intersect these with the k-mers in the training database sketches and dump these to a file
- Reformat these dumped k-mers into a FASTA-looking file
- Feed that into
StreamingQueryDNADatabase.py
Steps 2-5 is basically what's done here as I noted this approach to Nathan LaPierre, but never got around to implementing it in CMash yet.
When running
StreamingQueryDNADatabase.py, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to:StreamingQueryDNADatabase.pySteps 2-5 is basically what's done here as I noted this approach to Nathan LaPierre, but never got around to implementing it in CMash yet.