Ensure lifecycle tasks wait for messages to be pushed #2603

francoisferrand · 2024-12-18T13:51:03Z

lifecycle task pushes new entries to bucket topic, but may commit before the entry is commited : which allows multiple lifeycle iterations to happen in parallel.

Issue: BB-641

bert-e · 2024-12-18T13:51:06Z

Hello francoisferrand,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

codecov · 2024-12-18T14:04:50Z

Codecov Report

Attention: Patch coverage is 85.71429% with 5 lines in your changes missing coverage. Please review.

Project coverage is 55.34%. Comparing base (61b9e9a) to head (0c257d7).

Files with missing lines	Patch %	Lines
extensions/lifecycle/tasks/LifecycleTaskV2.js	88.46%	3 Missing ⚠️
extensions/lifecycle/tasks/LifecycleTask.js	77.77%	2 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
extensions/lifecycle/tasks/LifecycleTask.js	`83.30% <77.77%> (+0.11%)`	⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js	`89.74% <88.46%> (+0.85%)`	⬆️

... and 4 files with indirect coverage changes

Components	Coverage Δ
Bucket Notification	`18.51% <ø> (ø)`
Core Library	`61.90% <ø> (-0.23%)`	⬇️
Ingestion	`67.53% <ø> (ø)`
Lifecycle	`47.15% <85.71%> (+0.24%)`	⬆️
Oplog Populator	`84.20% <ø> (ø)`
Replication	`51.01% <ø> (-0.04%)`	⬇️
Bucket Scanner	`85.60% <ø> (ø)`

@@                 Coverage Diff                 @@
##           development/8.6    #2603      +/-   ##
===================================================
- Coverage            55.40%   55.34%   -0.06%     
===================================================
  Files                  198      198              
  Lines                12915    12928      +13     
===================================================
  Hits                  7155     7155              
- Misses                5750     5763      +13     
  Partials                10       10

Flag	Coverage Δ
api:retry	`9.62% <0.00%> (-0.01%)`	⬇️
api:routes	`9.51% <0.00%> (-0.01%)`	⬇️
bucket-scanner	`85.60% <ø> (ø)`
ingestion	`12.45% <0.00%> (-0.02%)`	⬇️
lib	`7.51% <0.00%> (-0.01%)`	⬇️
lifecycle	`19.44% <85.71%> (+0.08%)`	⬆️
notification	`0.88% <0.00%> (-0.01%)`	⬇️
replication	`18.87% <0.00%> (-0.13%)`	⬇️
unit	`5.13% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

bert-e · 2024-12-19T10:02:23Z

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

extensions/lifecycle/tasks/LifecycleTaskV2.js

Issue: BB-641

extensions/lifecycle/tasks/LifecycleTaskV2.js

allSettled does not follow the usuage fullfil pattern: it will never reject, and always fullfil with an array of the results of each promises. This is not an issue in the case of lifecycle, where we actually ignore all errors; but it makes the code look inconsistent, as it suggests errors are possible but not handle them. To avoid future issues, add proper processing of the results of allSettled to build a single error when appropriate. Issue: BB-641

nicolas2bert

Waiting for the Kafka message delivery report callback may not be ideal for at least two reasons:

The message is already considered as locally consumed even before it reached the queue processor queue.
The lifecycle conductor relies on KafkaBacklogMetrics to calculate the lag.
KafkaBacklogMetrics uses consumer.position() (for consumers) and queryWatermarkOffsets() (for producers/clients) to fetch offset info, then writes those offsets into Zookeeper.
consumer.position() in node-rdkafka returns the consumer’s current read position for each assigned partition, i.e. the local offset that the consumer is about to read next (offset of the last consumed message + 1). It does not return the “committed offset” that Kafka itself stores for the consumer group.  A message is considered “locally consumed” as soon as the consumer (via librdkafka under the hood) has handed that message to BackbeatConsumer. Basically, right after consumer.consume(...) is called and the message is shown in your callback or event handler, the offset for that message is advanced in the local consumer state.
It will slow down our process since sendToTopic callback is called when Kafka returns the delivery report for this message. Delivery reports are invoked asynchronously after the producer has handed off messages to Kafka. Ifwe decide to block or wait synchronously for that delivery report every time we send a message, it will impact performance and throughput.

francoisferrand · 2025-01-21T20:16:50Z

The message is already considered as locally consumed even before it reached the queue processor queue

That is a fair point (and may actually help on another issue), but I don't really see how this is a problem for this change: handling an entry by the bucket processor typically takes at least one second already (scanning & checking the state of every object), so we face this discrepancy anyway...

This change is simply about ensuring that the we keep the "slot" until the entry is "fully" processed, instead of leaving many things pending: which can be an issue esp. since we are listing pushing continuation messages.

What am I missing here?

Ifwe decide to block or wait synchronously for that delivery report every time we send a message, it will impact performance and throughput.

In theory yes;
Practically however, since we are processing up to 1000 entries at a time, I wonder if this makes a real impact: most of the reports would be received in the time we process each entry... (except for very small buckets, in which case throughput may not be so important)

It is certainly a trade off, but consistent processing seems important as well: or do you think it is completely safe to leave all these messages dangling, and already start processing next message(s)?

nicolas2bert · 2025-01-22T09:56:24Z

What am I missing here?

My understanding was that the goal of this PR is to prevent multiple lifecycle iterations (triggered by Conductor) from running in parallel. I just pointed out that the lag is based on the “locally consumed” offset rather than on a processed or stored offset. So even if we wait for an entry to be fully processed, it won't stop the bucket-lifecycle topic lag from being zero while there are still other bucket messages in the pipeline.

most of the reports would be received in the time we process each entry

Regarding the internal lifecycle listing, it does not necessarily return a 1000 objects; it only includes those that meet the specified criteria (prefix, age, etc...) from the next 10,000 entries. We might even end up with a listing response containing only a few objects, or none at all.
NOTE: This 10,000 entry limit helps avoid placing excessive load on Metadata by preventing the evaluation of an unbounded number of entries.

francoisferrand force-pushed the bugfix/BB-641 branch from 389da95 to ca50af2 Compare December 18, 2024 13:54

scality deleted a comment from bert-e Dec 19, 2024

williamlardier reviewed Dec 19, 2024

View reviewed changes

extensions/lifecycle/tasks/LifecycleTaskV2.js Outdated Show resolved Hide resolved

francoisferrand requested a review from nicolas2bert December 20, 2024 12:48

Ensure lifecycle tasks wait for messages to be pushed

77f6a9b

Issue: BB-641

francoisferrand force-pushed the bugfix/BB-641 branch from ca50af2 to e3570fa Compare December 27, 2024 15:23

williamlardier reviewed Dec 27, 2024

View reviewed changes

extensions/lifecycle/tasks/LifecycleTaskV2.js Outdated Show resolved Hide resolved

francoisferrand force-pushed the bugfix/BB-641 branch from e3570fa to 9db0093 Compare December 27, 2024 16:00

williamlardier approved these changes Dec 27, 2024

View reviewed changes

francoisferrand force-pushed the bugfix/BB-641 branch from 9db0093 to 0c257d7 Compare December 30, 2024 17:02

Kerkesni approved these changes Jan 16, 2025

View reviewed changes

nicolas2bert requested changes Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure lifecycle tasks wait for messages to be pushed #2603

Ensure lifecycle tasks wait for messages to be pushed #2603

francoisferrand commented Dec 18, 2024 •

edited

Loading

bert-e commented Dec 18, 2024

codecov bot commented Dec 18, 2024 •

edited

Loading

bert-e commented Dec 19, 2024

nicolas2bert left a comment

francoisferrand commented Jan 21, 2025

nicolas2bert commented Jan 22, 2025

Ensure lifecycle tasks wait for messages to be pushed #2603

Are you sure you want to change the base?

Ensure lifecycle tasks wait for messages to be pushed #2603

Conversation

francoisferrand commented Dec 18, 2024 • edited Loading

bert-e commented Dec 18, 2024

Hello francoisferrand,

codecov bot commented Dec 18, 2024 • edited Loading

Codecov Report

bert-e commented Dec 19, 2024

Request integration branches

nicolas2bert left a comment

Choose a reason for hiding this comment

francoisferrand commented Jan 21, 2025

nicolas2bert commented Jan 22, 2025

francoisferrand commented Dec 18, 2024 •

edited

Loading

codecov bot commented Dec 18, 2024 •

edited

Loading