Multiple k-mer sizes confirmation and testing

Definitions:
"new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values
"old method" = train and re-run CMash on each individual k-mer size.

Tasks:
- [x] Address #19 so we no nothing funky is happening with the current implementation.
- [x] Create testing environment to prepare for comparing old method to new method
- [ ] Compare old to new method in estimating containment indexes between individual genomes (i.e. [using this kind of command](https://github.com/dkoslicki/CMash/blob/master/CMash/MinHash.py#L198)). Repeat over many different genomes to get an idea of the difference between old and new method
- [ ] Compare old to new method in estimating presence/absence of training database organisms in many simulated metagenomes (i.e. run `StreamingQueryDNADatabase.py` with multiple k-mer size training database and compare to running `StreamingQueryDNADatabase.py` many times with training databases trained with a specific k-mer size).

This would be sufficient for a conference paper. More details can follow depending on interest. 

For a journal publication, would need to:
- [ ] Understand the theory behind the bias in the k-mer prefix truncation (@dkoslicki has already written it up)
- [ ] Test the magnitude of the bias factor over many genomes (relatively straightforward task).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple k-mer sizes confirmation and testing #20

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multiple k-mer sizes confirmation and testing #20

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions