Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure lifecycle tasks wait for messages to be pushed #2603

Open
wants to merge 2 commits into
base: development/8.6
Choose a base branch
from

Conversation

francoisferrand
Copy link
Contributor

@francoisferrand francoisferrand commented Dec 18, 2024

lifecycle task pushes new entries to bucket topic, but may commit before the entry is commited : which allows multiple lifeycle iterations to happen in parallel.

Issue: BB-641

@bert-e
Copy link
Contributor

bert-e commented Dec 18, 2024

Hello francoisferrand,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

Copy link

codecov bot commented Dec 18, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 5 lines in your changes missing coverage. Please review.

Project coverage is 55.34%. Comparing base (61b9e9a) to head (0c257d7).

Files with missing lines Patch % Lines
extensions/lifecycle/tasks/LifecycleTaskV2.js 88.46% 3 Missing ⚠️
extensions/lifecycle/tasks/LifecycleTask.js 77.77% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
extensions/lifecycle/tasks/LifecycleTask.js 83.30% <77.77%> (+0.11%) ⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js 89.74% <88.46%> (+0.85%) ⬆️

... and 4 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 18.51% <ø> (ø)
Core Library 61.90% <ø> (-0.23%) ⬇️
Ingestion 67.53% <ø> (ø)
Lifecycle 47.15% <85.71%> (+0.24%) ⬆️
Oplog Populator 84.20% <ø> (ø)
Replication 51.01% <ø> (-0.04%) ⬇️
Bucket Scanner 85.60% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/8.6    #2603      +/-   ##
===================================================
- Coverage            55.40%   55.34%   -0.06%     
===================================================
  Files                  198      198              
  Lines                12915    12928      +13     
===================================================
  Hits                  7155     7155              
- Misses                5750     5763      +13     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.62% <0.00%> (-0.01%) ⬇️
api:routes 9.51% <0.00%> (-0.01%) ⬇️
bucket-scanner 85.60% <ø> (ø)
ingestion 12.45% <0.00%> (-0.02%) ⬇️
lib 7.51% <0.00%> (-0.01%) ⬇️
lifecycle 19.44% <85.71%> (+0.08%) ⬆️
notification 0.88% <0.00%> (-0.01%) ⬇️
replication 18.87% <0.00%> (-0.13%) ⬇️
unit 5.13% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@scality scality deleted a comment from bert-e Dec 19, 2024
@bert-e
Copy link
Contributor

bert-e commented Dec 19, 2024

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

allSettled does not follow the usuage fullfil pattern: it will never
reject, and always fullfil with an array of the results of each
promises.

This is not an issue in the case of lifecycle, where we actually ignore
all errors; but it makes the code look inconsistent, as it suggests
errors are possible but not handle them.

To avoid future issues, add proper processing of the results of
allSettled to build a single error when appropriate.

Issue: BB-641
Copy link
Contributor

@nicolas2bert nicolas2bert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for the Kafka message delivery report callback may not be ideal for at least two reasons:

  1. The message is already considered as locally consumed even before it reached the queue processor queue.
    The lifecycle conductor relies on KafkaBacklogMetrics to calculate the lag.
    KafkaBacklogMetrics uses consumer.position() (for consumers) and queryWatermarkOffsets() (for producers/clients) to fetch offset info, then writes those offsets into Zookeeper.
    consumer.position() in node-rdkafka returns the consumer’s current read position for each assigned partition, i.e. the local offset that the consumer is about to read next (offset of the last consumed message + 1). It does not return the “committed offset” that Kafka itself stores for the consumer group.

A message is considered “locally consumed” as soon as the consumer (via librdkafka under the hood) has handed that message to BackbeatConsumer. Basically, right after consumer.consume(...) is called and the message is shown in your callback or event handler, the offset for that message is advanced in the local consumer state.

  2. It will slow down our process since sendToTopic callback is called when Kafka returns the delivery report for this message. Delivery reports are invoked asynchronously after the producer has handed off messages to Kafka. Ifwe decide to block or wait synchronously for that delivery report every time we send a message, it will impact performance and throughput.

@francoisferrand
Copy link
Contributor Author

The message is already considered as locally consumed even before it reached the queue processor queue

That is a fair point (and may actually help on another issue), but I don't really see how this is a problem for this change: handling an entry by the bucket processor typically takes at least one second already (scanning & checking the state of every object), so we face this discrepancy anyway...

This change is simply about ensuring that the we keep the "slot" until the entry is "fully" processed, instead of leaving many things pending: which can be an issue esp. since we are listing pushing continuation messages.

What am I missing here?

Ifwe decide to block or wait synchronously for that delivery report every time we send a message, it will impact performance and throughput.

In theory yes;
Practically however, since we are processing up to 1000 entries at a time, I wonder if this makes a real impact: most of the reports would be received in the time we process each entry... (except for very small buckets, in which case throughput may not be so important)

It is certainly a trade off, but consistent processing seems important as well: or do you think it is completely safe to leave all these messages dangling, and already start processing next message(s)?

@nicolas2bert
Copy link
Contributor

What am I missing here?

My understanding was that the goal of this PR is to prevent multiple lifecycle iterations (triggered by Conductor) from running in parallel. I just pointed out that the lag is based on the “locally consumed” offset rather than on a processed or stored offset. So even if we wait for an entry to be fully processed, it won't stop the bucket-lifecycle topic lag from being zero while there are still other bucket messages in the pipeline.

most of the reports would be received in the time we process each entry

Regarding the internal lifecycle listing, it does not necessarily return a 1000 objects; it only includes those that meet the specified criteria (prefix, age, etc...) from the next 10,000 entries. We might even end up with a listing response containing only a few objects, or none at all.
NOTE: This 10,000 entry limit helps avoid placing excessive load on Metadata by preventing the evaluation of an unbounded number of entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants