-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11475. EC: Verify EC reconstruction correctness on DN #7220
Conversation
"reconstruct target containers correctly. When validation fails, " + | ||
"reconstruction tasks will fail.", | ||
tags = ConfigTag.CLIENT) | ||
private boolean ecReconstructValidation = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any drawbacks to having this feature enabled? If we commit it as disabled it will likely never be turned on and we may get burned by a correctness issue this could have caught. If it is relatively safe we may not even want/need a config key to disable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your reviewing! I have updated the description of this PR. In fact, we use the data obtained after decoding for another decoding and compare it with the original data to verify whether the decoding is correct. This may affect the speed of reconstruction. Maybe whether to enable this feature may need to be decided by user.
419da96
to
ad99afb
Compare
For Ozone, we have committed a "stripe checksum" when writing each stripe to the majority of the replicas. Therefore, to prove the correctness of the reconstruction you simple have to form the new stripe checksum, which should be much more efficient that an additional EC pass. My memory on what the stripe checksum contains is a little lacking, but we added it to handle the corruption case we saw with HDFS, but we never implemented the validation. Ie, it is created on write, but we never use it on reconstruction. The stripe checksum approach would be the preferred way to perform this validation on Ozone. |
You can see the stripe checksum is formed in ECKeyOutputStream in It appears to be a concatenation of the checksums of all the chunks across the stripe. Hopefully there is a way to combined the newly reconstructed chunks with the existing ones to form the same checksum. |
What is the plan of action on this change? Should we alter the approach to use stripe checksums for validation? |
I would suggest altering the approach to use the checksum. It should be a case of locating the correct sequence of checksums in the stripe checksum and verifying the new reconstructed checksums match. Then we know the reconstruction has created the same data. |
@sodonnel Thank you for the suggestios! I agree with you that it's more convenient and efficient to verify EC reconstruction by stripe checksum in Ozone. Maybe I need to shut down this PR and learn more about how checksum works. |
What changes were proposed in this pull request?
HDFS-15759 shows a good way to prevent potential EC reconstruction correctness on datanode, so maybe we can adapt it to Ozone.
Now this PR has worked over 3 months on our clusters.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11475
How was this patch tested?
Unit tests and online tests.