Fix non-deterministic fingerprint for nested hashes by sorting#41
Fix non-deterministic fingerprint for nested hashes by sorting#41ypid-geberit wants to merge 4 commits intologstash-plugins:masterfrom
Conversation
56b8e0b to
f7c5d56
Compare
Consider this example:
```JSON
{
"@timestamp": "2023-05-23T23:23:23.555Z",
"@Version": "1",
"beat": {
"hostname": "gnu.example.com",
"name": "gnu.example.com",
"version": "5.2.2"
},
"host": "gnu.example.com"
}
```
Using this filter:
```Logstash
fingerprint {
concatenate_all_fields => true
target => "[@metadata][_id]"
method => "SHA512"
key => "XXX"
base64encode => true
}
```
Here, the order of the `.beat` hash is non-deterministic and the plugin
did not do a deep sort as part of the serialization. This resulted in
different fingerprints for the same event because the order of the three
keys (hostname, name, version) changed randomly in the serialization.
This has been fixed by recursively checking for hashes and serializing
them in sorted order.
Note that this changes the serialization format and thus breaks
backwards compatibility. The old format could be emulated in order to
not break backwards compatibility. Backwards compatibility in this case
means to generate the same fingerprint for the same input.
Closes: logstash-plugins#39
f7c5d56 to
357a36e
Compare
yaauie
left a comment
There was a problem hiding this comment.
@ypid-geberit thank you for contributing!
Unfortunately, I believe this to be a breaking change (shown off by the broken spec and otherwise noted in-line), and will need to be either an opt-in feature or accompany a major version change (and not be bundled with Logstash until 7.0).
Because this plugin is widely used to generate consistent ids, there is a large risk that any breaking change here will have significant and surprising repercussions to many users. Even if we were to ship with a major version, there would need to be a backward-compatibility fallback that has unmodified behaviour. I would advocate instead to make this an opt-in-feature within the current major.
To do so, we would need a flag (naming is hard, but maybe force_sort?) that defaults to false, but when set causes the new sorting behaviour to be invoked.
Since Logstash does maintain iteration order of Hashes once they have been created, a near-term workaround would be to explicitly order the hash before fingerprinting it:
filter {
ruby {
# sort beat for consistent fingerprinting
code => "beat = event.get('beat'); beat && event.set('beat', Hash[beat.sort])"
}
fingerprint {
# ...
}
}
lib/logstash/filters/fingerprint.rb
Outdated
| event.to_hash.sort.map do |k,v| | ||
| to_string << "|#{k}|#{v}" | ||
| end | ||
| to_string << serialize(event) |
There was a problem hiding this comment.
This would be a breaking change for any pipeline configuration that makes uses of concatenate_all_fields, because the string changes from pipe-delimited kv to json-ish.
| if event.respond_to?(:to_hash) | ||
| to_string << "{" | ||
| event.to_hash.sort.map do |k,v| | ||
| to_string << "#{k}:#{serialize(v)}," |
There was a problem hiding this comment.
this would be a breaking change in two ways:
- current
Hash#to_soutputs key/value pairs joined by=>, not by: - ordering
Ideally, `deep_sort` would be the default some day.
|
Thanks for your review @yaauie 👍 Backwards compatibility is definitely a discussion point here and I find it reasonable to introduce a parameter like
This would be interesting but I did not understand who I could rewrite the An open question is also if the |
16ed4f6 to
eb38146
Compare
|
+1 we need this ! |
|
Would be really helpful to get this merged. Many thanks in advance. |
|
+1 any updates on when this is getting merged? |
|
#52 is probably a cleaner solution for this issue. Please have a look to get this fixed as this issue is hard to debug and probably occurs for all users of the plugin. The question is just if they have noticed it yet or not. |
Fingerprinting is done by either fetching the .to_hash representation or by fetching fields with .get(field) The content of these often contain hashes, which we don't guarantee order during insertion, so this commit recursively sorts hashes. This change affects all modes: * concatenate_all_sources => true * concatenate_fields => true * normal single field fingerprint (that may contain a nested Hash) Fixes #39 Replaces #41, #52
|
Nice to see this bug finally got fixed by #55. Thanks! Closing this alternative PR. |
Consider this example:
{ "@timestamp": "2023-05-23T23:23:23.555Z", "@version": "1", "beat": { "hostname": "gnu.example.com", "name": "gnu.example.com", "version": "5.2.2" }, "host": "gnu.example.com" }Using this filter:
Here, the order of the
.beathash is non-deterministic and the plugin did not do a deep sort as part of the serialization. This resulted in different fingerprints for the same event because the order of the three keys (hostname, name, version) changed randomly in the serialization.This has been fixed by recursively checking for hashes and serializing them in sorted order.
Note that this changes the serialization format and thus breaks backwards compatibility. The old format could be emulated in order to not break backwards compatibility. Backwards compatibility in this case means to generate the same fingerprint for the same input. This is also the reason why the unit test on Travis fails.