-
Notifications
You must be signed in to change notification settings - Fork 89
zenko 5057 : Replication should not happen if object created before the replication config #2258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development/2.12
Are you sure you want to change the base?
Conversation
Hello sylvainsenechal,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
ef87fa5
to
61490ab
Compare
24e59aa
to
f82284b
Compare
d25baef
to
c561171
Compare
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command:
Alternatively, the |
d489809
to
7805158
Compare
tests/ctst/steps/replication.ts
Outdated
if ((expectedOutcome === 'succeed' || expectedOutcome === 'fail') && | ||
(replicationStatus === 'PENDING' || | ||
replicationStatus === 'PROCESSING' || | ||
replicationStatus === undefined // If replication hasn't started, status is still undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If replication hasn't started, status is still undefined
this is not correct: if replicationStatus is undefined, then replication will not be started at all.
the way replication works is that cloudserver
sets this field to PENDING when modifying the object (and replication is configured); then backbeat listens to the epilog and can filter objects which may require replication.
→ undefined does not mean replication is pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a situation where at the very start of the replication status check, the object replication status hasn't had the time to be set to PENDING, so it's still undefined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is weird : the status is set to PENDING in cloudserver, directly when writing the metadata of the new object/version c.f. [1]
If this is not the case, it means we are starting this check without waiting for the "return" of the s3:putObject - or maybe multiple tests are hitting the same object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see on this run https://github.com/scality/Zenko/actions/runs/17275504190/job/49031281060
If you grep "replicationStatus", the first few ones are "undefined", then "pending", and then "completed"
The crrExistingObject script is setting the PENDING status of the object with a PutMetadata method :
PUT (/_/backbeat/metadata/{Bucket}/{Key+})
Maybe there is some delay, or something asynchronous in the way the meta data is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, actually 👀 , the logic that I added to wait for the pod to complete was added after I encountered this issue, and the run I just linked above was not waiting for the pod to complete (waitForCompletion = false here), so that's the reason why it was undefined. So I'll change the logic, as you said, the replicationStatus should directly be PENDING
7805158
to
65ab469
Compare
65ab469
to
d354dcc
Compare
And an object "source-object-1" that "exists" | ||
And a replication configuration to "awsbackendmismatch" location | ||
When the job to replicate existing objects with status "NEW" is executed | ||
Then the object should eventually "be" replicated | ||
Then the object replication should "succeed" within 300 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency if we expect a replication to succeed up to 300s, we should do the same for the "never happen" case, no?
But as pointed out this would artificially make tests very long. If we thus think 30s is enough, we should probably use it here as well.
Another approach for such "test to see if this never happens" would be to look more in depth at the system to see if the old object was scanned after enabling the replication configuration. This allows to have a condition to stop on. But this is probably not realistic at the end-to-end level. Do we think the scenario Objects created before setting up replication should not be replicated automatically
can only be tested at the Zenko level, and not Backbeat as a functional test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we wanted to be perfect we should use the same timeout everywhere. In practice I've found the failure case to take some more time to be marked as a failure (versus the success case), and the 'never happen' case doesn't really require a lot of waiting. I think the timings are good this way.
About the other point, I think technically by looking inside the backbeat codebase, we should be able to assert that replication doesn't happen to objects created before the replication configuration is set. But I guess it's quite hard to determine that this does not happen, because it doesnt happen so there is nothing see. But the test is still nice to enforce that the behavior remains the same over time, as any modification to any service could break this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically the "decision" to replicate is made by cloudserver, and reflected by the status being set to PENDING only when replication is needed : so just checking this is be enough, with the current design.
d354dcc
to
8b37b22
Compare
@@ -89,13 +89,16 @@ When('the job to replicate existing objects with status {string} is executed', | |||
await createAndRunPod(this, podManifest); | |||
}); | |||
|
|||
Then('the object should eventually {string} replicated', { timeout: 360_000 }, | |||
async function (this: Zenko, replicate: 'be' | 'fail to be') { | |||
Then('the object replication should {string} within {int} seconds', { timeout: 600_000 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually need to increase timeout from 6 to 10 minutes?
did you experience cases of timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found sometimes I couldn't make it work within 200/300 seconds so I increased it, hopefully no problems with larger timeout
Issue: ZENKO-5057