deeply sort hashes to ensure consistent fingerprints#55
deeply sort hashes to ensure consistent fingerprints#55jsvd merged 2 commits intologstash-plugins:masterfrom
Conversation
2bb56fc to
9b50a97
Compare
Fingerprinting is done by either fetching the .to_hash representation or by fetching fields with .get(field) The content of these often contain hashes, which we don't guarantee order during insertion, so this commit recursively sorts hashes. This happens in all modes: * concatenate_all_sources * concatenate_fields * normal single field fingerprint
9b50a97 to
f5f952d
Compare
|
Thinking back to the concerns raised in #41 (review), there are a few items to keep in mind before merging this. The existing implementation already sorts the top level keys of a hash, either during concatenate_all_fields or if multiple sources are listed. I ran many different scenarios and I can't generate fingerprints that are consistently different from the ones produced by this PR. Here is a link to a spreadsheet where I compared the two implementations against a different set of data and fingerprint strategies (all using SHA1): https://docs.google.com/spreadsheets/d/15tEkBJXk6_f7j4g6IHofsEAWG9ATxGI8ZFZaqMcgujA/edit?usp=sharing So in summary, I can't find a reason to declare this a breaking change. With all of this in mind I'd like a second review from @colinsurprenant and @andsel. Please let me know if I missed any corner case that can lead to breaking change for the user. |
andsel
left a comment
There was a problem hiding this comment.
LGTM
I agree with you when you says that's not a breaking change, since previously the key's order wasn't consistent, and every run with same data could give different results, so it was breaking itself. As the spreadsheet show we don't break more condition than wasn't already broken before.
|
LGTM2 |
Fingerprinting is done by either fetching the .to_hash representation
or by fetching fields with .get(field)
The content of these often contain hashes, which we don't guarantee
order during insertion, so this commit recursively sorts hashes.
This happens in all modes:
NOTE: I propose we don't consider this a breaking change (therefore not requiring a major version bump). The current ordering is unpredictable and therefore is much more of a bug than a feature. The only two paths forward would have been:
concatenate_all_sources)Option 2 was much more likely to produce breaking changes, so we went here with 1.
Fixes #39
Replaces #41, #52