Expose a more straightforward API for base26 encoding #193
Artoria2e5
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The current base26 API is constrained by two things: segment length (only exactly 14 and 9) and byte boundary. We have:
This presents a bit of a problem for downstream users trying to make longer hashes, specifically in RInChi's 17-character key (https://github.com/IUPAC-InChI/RInChI/blob/46643428f8a547114466dc34ab847bfb890b836a/src/lib/rinchi_hashing.cpp#L133-L151)
The comment says
chksum[8] = bits 65->.... This is a lie. We are actually hashing bits64..77, with bit 64 used twice. (See IUPAC-InChI/RInChI#28.) This is an unfortunate waste of a bit of distinguishing power. It has already happened, but future Keys need not suffer from this if we understand why.Why did it happen? Can it be better?
To fit the current base26 API. It only provides for consuming byte-aligned data at fixed offsets. Consider instead this Python description of the base26 algorithm:
By using Python's native integers which are arbitrary precision ("bigints"), this implementation does not need to care about byte offsets or masks (remember your struggle with
FIX_BASE26_ENC_BUG?). It just keeps taking the least significant bits of the hash interpreted as a little-endian.It would be unreasonable to expect users to compile InChI with a bigint library like GMP. But we don't need a bigint library to do something similar:
gcd(14, 8) = 56, which fits in aUINT64. Everything supports UINT64. Therefore consider this expression instead:The original API can be implemented as (the caveat being a potential OOB in the first four if
len(a) < 7-- modify take56LE to take a length argument if concerned):A future maker of bigger hashes may decide to use:
There! It's easier to implement for the library (no more endianness woes). It's clear about what it does, which is great for the user. The compiler sees simpler expressions with more common parts to eliminate (sprinkle some
inlinearound). Everyone wins.Tangentials
Beta Was this translation helpful? Give feedback.
All reactions