Skip to content

Improved classification time with KMC #15

@dkoslicki

Description

@dkoslicki

When running StreamingQueryDNADatabase.py, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to:

  1. Dump all the training database sketch k-mers using KMC (like it's done here
  2. Use KMC to count the k-mers in the sample
  3. Intersect these with the k-mers in the training database sketches and dump these to a file
  4. Reformat these dumped k-mers into a FASTA-looking file
  5. Feed that into StreamingQueryDNADatabase.py

Steps 2-5 is basically what's done here as I noted this approach to Nathan LaPierre, but never got around to implementing it in CMash yet.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions