Skip to content

Multiple k-mer sizes confirmation and testing #20

@dkoslicki

Description

@dkoslicki

Definitions:
"new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values
"old method" = train and re-run CMash on each individual k-mer size.

Tasks:

  • Address Multiple k-mer sizes bug #19 so we no nothing funky is happening with the current implementation.
  • Create testing environment to prepare for comparing old method to new method
  • Compare old to new method in estimating containment indexes between individual genomes (i.e. using this kind of command). Repeat over many different genomes to get an idea of the difference between old and new method
  • Compare old to new method in estimating presence/absence of training database organisms in many simulated metagenomes (i.e. run StreamingQueryDNADatabase.py with multiple k-mer size training database and compare to running StreamingQueryDNADatabase.py many times with training databases trained with a specific k-mer size).

This would be sufficient for a conference paper. More details can follow depending on interest.

For a journal publication, would need to:

  • Understand the theory behind the bias in the k-mer prefix truncation (@dkoslicki has already written it up)
  • Test the magnitude of the bias factor over many genomes (relatively straightforward task).

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions