-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Current Molecule
hashing is dangerous
#1772
Comments
Molecule
hashing should use mapped SMILESMolecule
hashing is dangerous
An update from a discussion yesterday between @j-wags and I: We agree that the current behavior is not-very-good, and also (with different weighting) that changing the current behavior will unfortunately be costly. Changing it to mapped SMILES significantly lessens the probability of a hash collision, but not quite to zero. This has the downside of a behavior change - though argue this is a pure improvement and such a behavior change is justified. There are still some corner cases that could cause hash collisions, since i.e. conformers and other properties are not considered in this hash. There's already some existing machinery for what one could accurately enough describe as a hash. See Notably in this case, overriding Chagning it to We didn't reach any conclusions, yet, and will need to draw on some more use cases to better understand the tradeoffs. I doubt too many people are calling |
Bumping this - I had a use case in which I needed to hash molecules (since indexing into a large topology is slow). Directly hashing the molecule was slow because of the conversion step, but |
Is your feature request related to a problem? Please describe.
Molecule.__hash__
uses (non-mapped) SMILES, which can cause hash collisions when otherwise identical molecules use different atom indices. This is almost never used, but some Python internals and downstream packages leverage objects' hashes to define uniqueness.Here's a contrived example that demonstrates how
@functools.lru_cache
can misfire due to this hashing:Describe the solution you'd like
Simply use mapped SMILES.
Describe alternatives you've considered
My band-aid was to also pass a molecule's mapped SMILES to the function that's wrapped by
lru_cache
; my aim here is pretty much to make that necessary.Additional context
A more involved case of this pathology causing Interchange charge assignment caching to map charges to incorrect atoms in some cases, causing critical failures in some fitting pipelines.
#522 might be related but I don't think they're completely aligned
The text was updated successfully, but these errors were encountered: