Skip to content

DnaStringSlice should abstract away the backing DnaString rc status #21

@d-cameron

Description

@d-cameron

The DnaStringSlice implementation is inconsistent when the slice is reverse complemented. Mer::get() is implemented (what I consider to be) correctly, but other functions only work if the slice has is_rc == false.

#[test]
fn dnastringslice_get_kmer() {
    let seq = DnaString::from_dna_string("ACGGTAC");
    let seqrc = DnaString::from_dna_string("GTACCGT");
    let rcslice = seq.slice(0, 7).rc();
    let slice = seqrc.slice(0, 7);
    for i in 0..=3 {
        // The kmer in a slice should be the kmer of the sequencing represented
        // by that slice, regardless of whether the backing DnaString is RC or not.
        assert_eq!(slice.get_kmer::<Kmer4>(i), rcslice.get_kmer::<Kmer4>(i));
    }
}
#[test]
fn dnastringslice_slice() {
    let seq = DnaString::from_dna_string("ACGGTAC");
    let seqrc = DnaString::from_dna_string("GTACCGT");
    let rcslice = seq.slice(0, 7).rc();
    let slice = seqrc.slice(0, 7);
    // The fact that a slice is backed by a DnaStringSlice that is
    // the rc of the slice sequence shouldn't matter.
    assert_eq!(rcslice.slice(1, 4), slice.slice(1, 4));
}

On a related topic, would it make sense to split the 2-bit encoding of a DNA string and associated kmer logic into their own crate? Two bit encoding and kmer counting is generically useful outside of de bruijn graph construction and is used for everything from kmer counting, error correction, mimimizer hash tables, to reference genome storage (e.g. sequence interval lookups in a memory maped 2bit encoded (http://genome.ucsc.edu/FAQ/FAQformat.html#format7) reference genome would be very efficient but unfortunately, you've chosen a different packed encoding^).

^ The UCSC encoding uses a bit encoding in which the MSB indicate a purine base, so complementing the sequence is XORing with 0xAAAAAAAA instead of flipping all bits which is the approach used here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions