Improved classification time with KMC

When running `StreamingQueryDNADatabase.py`, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to:
1. Dump all the training database sketch k-mers  using [KMC](https://github.com/refresh-bio/KMC) (like it's done [here](https://github.com/nlapier2/Metalign/blob/master/local_tests/retrain_and_test_metalign.sh#L56-L66)
2. Use [KMC](https://github.com/refresh-bio/KMC) to count the k-mers in the sample
3. Intersect these with the k-mers in the training database sketches and dump these to a file
4. Reformat these dumped k-mers into a FASTA-looking file
5. Feed that into `StreamingQueryDNADatabase.py`

Steps 2-5 is basically what's done [here](https://github.com/nlapier2/Metalign/blob/master/select_db.py#L57-L80) as I noted this approach to Nathan LaPierre, but never got around to implementing it in CMash yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved classification time with KMC #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improved classification time with KMC #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions