Skip to content

Conversation

SylvainSenechal
Copy link
Contributor

Issue: ZENKO-5057

@bert-e
Copy link
Contributor

bert-e commented Aug 21, 2025

Hello sylvainsenechal,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@SylvainSenechal SylvainSenechal changed the base branch from development/2.13 to improvement/ZENKO-5050 August 21, 2025 09:02
@SylvainSenechal SylvainSenechal force-pushed the improvement/ZENKO-5057 branch 6 times, most recently from ef87fa5 to 61490ab Compare August 21, 2025 11:21
@SylvainSenechal SylvainSenechal changed the title Improvement/zenko 5057 zenko 5057 : Replication should not happen if object created before the replication config Aug 21, 2025
@SylvainSenechal SylvainSenechal force-pushed the improvement/ZENKO-5050 branch 2 times, most recently from 24e59aa to f82284b Compare August 22, 2025 08:25
@SylvainSenechal SylvainSenechal force-pushed the improvement/ZENKO-5050 branch 5 times, most recently from d25baef to c561171 Compare September 1, 2025 15:03
Base automatically changed from improvement/ZENKO-5050 to development/2.12 September 2, 2025 17:23
@bert-e
Copy link
Contributor

bert-e commented Sep 2, 2025

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@SylvainSenechal SylvainSenechal force-pushed the improvement/ZENKO-5057 branch 2 times, most recently from d489809 to 7805158 Compare September 2, 2025 20:27
@SylvainSenechal SylvainSenechal marked this pull request as ready for review September 3, 2025 15:09
@scality scality deleted a comment from bert-e Sep 3, 2025
if ((expectedOutcome === 'succeed' || expectedOutcome === 'fail') &&
(replicationStatus === 'PENDING' ||
replicationStatus === 'PROCESSING' ||
replicationStatus === undefined // If replication hasn't started, status is still undefined
Copy link
Contributor

@francoisferrand francoisferrand Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// If replication hasn't started, status is still undefined

this is not correct: if replicationStatus is undefined, then replication will not be started at all.
the way replication works is that cloudserver sets this field to PENDING when modifying the object (and replication is configured); then backbeat listens to the epilog and can filter objects which may require replication.

→ undefined does not mean replication is pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a situation where at the very start of the replication status check, the object replication status hasn't had the time to be set to PENDING, so it's still undefined

Copy link
Contributor

@francoisferrand francoisferrand Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is weird : the status is set to PENDING in cloudserver, directly when writing the metadata of the new object/version c.f. [1]

[1] https://github.com/scality/cloudserver/blob/d0ae11ef265a765de2fb6f775daf66d0e971de11/lib/api/apiUtils/object/createAndStoreObject.js#L139

If this is not the case, it means we are starting this check without waiting for the "return" of the s3:putObject - or maybe multiple tests are hitting the same object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see on this run https://github.com/scality/Zenko/actions/runs/17275504190/job/49031281060
If you grep "replicationStatus", the first few ones are "undefined", then "pending", and then "completed"

The crrExistingObject script is setting the PENDING status of the object with a PutMetadata method :
PUT (/_/backbeat/metadata/{Bucket}/{Key+})
Maybe there is some delay, or something asynchronous in the way the meta data is updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, actually 👀 , the logic that I added to wait for the pod to complete was added after I encountered this issue, and the run I just linked above was not waiting for the pod to complete (waitForCompletion = false here), so that's the reason why it was undefined. So I'll change the logic, as you said, the replicationStatus should directly be PENDING

And an object "source-object-1" that "exists"
And a replication configuration to "awsbackendmismatch" location
When the job to replicate existing objects with status "NEW" is executed
Then the object should eventually "be" replicated
Then the object replication should "succeed" within 300 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency if we expect a replication to succeed up to 300s, we should do the same for the "never happen" case, no?
But as pointed out this would artificially make tests very long. If we thus think 30s is enough, we should probably use it here as well.

Another approach for such "test to see if this never happens" would be to look more in depth at the system to see if the old object was scanned after enabling the replication configuration. This allows to have a condition to stop on. But this is probably not realistic at the end-to-end level. Do we think the scenario Objects created before setting up replication should not be replicated automatically can only be tested at the Zenko level, and not Backbeat as a functional test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we wanted to be perfect we should use the same timeout everywhere. In practice I've found the failure case to take some more time to be marked as a failure (versus the success case), and the 'never happen' case doesn't really require a lot of waiting. I think the timings are good this way.

About the other point, I think technically by looking inside the backbeat codebase, we should be able to assert that replication doesn't happen to objects created before the replication configuration is set. But I guess it's quite hard to determine that this does not happen, because it doesnt happen so there is nothing see. But the test is still nice to enforce that the behavior remains the same over time, as any modification to any service could break this later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically the "decision" to replicate is made by cloudserver, and reflected by the status being set to PENDING only when replication is needed : so just checking this is be enough, with the current design.

@@ -89,13 +89,16 @@ When('the job to replicate existing objects with status {string} is executed',
await createAndRunPod(this, podManifest);
});

Then('the object should eventually {string} replicated', { timeout: 360_000 },
async function (this: Zenko, replicate: 'be' | 'fail to be') {
Then('the object replication should {string} within {int} seconds', { timeout: 600_000 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually need to increase timeout from 6 to 10 minutes?
did you experience cases of timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found sometimes I couldn't make it work within 200/300 seconds so I increased it, hopefully no problems with larger timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants