-
Notifications
You must be signed in to change notification settings - Fork 4.1k
roachtest: deflake online-restore recovery #160027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
11d4885 to
f05d846
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
f05d846 to
e0b8dd9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses a flaky roachtest in the online-restore recovery functionality. The test was failing when the schemachange workload truncated tables, because the test only deleted SSTs from the full backup layer. After truncation, the deleted SST could be skipped during restore if its span was no longer covered.
Key Changes:
- Renamed
deleteUserTableSSTtodeleteSSTFromBackupLayersto reflect the new behavior - Modified the function to delete one SST from each layer (directory) of the backup collection instead of just one SST from the full backup
- Updated SQL query to use a CTE with regex-based directory extraction and window functions to select one SST per layer
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/cmd/roachtest/tests/mixed_version_backup.go | Refactored SST deletion logic to delete one SST from each backup layer, making the test resilient to table truncation |
| pkg/cmd/roachtest/tests/backup_restore_roundtrip.go | Updated function call to use the renamed deleteSSTFromBackupLayers method |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The OR recovery roachtest was failing due to the schemachange workload truncating a table and the roachtest only deleting from the full backup. In certain scenarios, the SST deleted from the full backup would end up being skipped in the restore due to the span it covered never being covered after truncation. This commit teaches OR to instead delete an SST from every layer of the backup collection, which should make the test resilient to truncation. Fixes: cockroachdb#159503 Release note: None
e0b8dd9 to
1051aac
Compare
msbutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! if you haven't already done so, maybe run the test 5 times to make sure there are no further related flakes?
|
@msbutler Yup, already ran it a couple of times! Although to be honest the flakes for these tests are so infrequent, I think they'd have to run for quite a while. We usually go a couple months before a flake on this test. |
|
TFTR! bors r=msbutler |
|
Build succeeded: |
|
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #159503: branch-release-25.3, branch-release-25.4. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
The OR recovery roachtest was failing due to the schemachange workload truncating a table and the roachtest only deleting from the full backup. In certain scenarios, the SST deleted from the full backup would end up being skipped in the restore due to the span it covered never being covered after truncation.
This commit teaches OR to instead delete an SST from every layer of the backup collection, which should make the test resilient to truncation.
Fixes: #159503
Release note: None