-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure lifecycle tasks wait for messages to be pushed #2603
base: development/8.6
Are you sure you want to change the base?
Conversation
Hello francoisferrand,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
389da95
to
ca50af2
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 4 files with indirect coverage changes
@@ Coverage Diff @@
## development/8.6 #2603 +/- ##
===================================================
- Coverage 55.40% 55.34% -0.06%
===================================================
Files 198 198
Lines 12915 12928 +13
===================================================
Hits 7155 7155
- Misses 5750 5763 +13
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command:
Alternatively, the |
ca50af2
to
e3570fa
Compare
e3570fa
to
9db0093
Compare
allSettled does not follow the usuage fullfil pattern: it will never reject, and always fullfil with an array of the results of each promises. This is not an issue in the case of lifecycle, where we actually ignore all errors; but it makes the code look inconsistent, as it suggests errors are possible but not handle them. To avoid future issues, add proper processing of the results of allSettled to build a single error when appropriate. Issue: BB-641
9db0093
to
0c257d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for the Kafka message delivery report callback may not be ideal for at least two reasons:
-
The message is already considered as locally consumed even before it reached the queue processor queue.
The lifecycle conductor relies onKafkaBacklogMetrics
to calculate the lag.
KafkaBacklogMetrics
usesconsumer.position()
(for consumers) andqueryWatermarkOffsets()
(for producers/clients) to fetch offset info, then writes those offsets into Zookeeper.
consumer.position()
in node-rdkafka returns the consumer’s current read position for each assigned partition, i.e. the local offset that the consumer is about to read next (offset of the last consumed message + 1). It does not return the “committed offset” that Kafka itself stores for the consumer group. A message is considered “locally consumed” as soon as the consumer (via librdkafka under the hood) has handed that message toBackbeatConsumer
. Basically, right after consumer.consume(...) is called and the message is shown in your callback or event handler, the offset for that message is advanced in the local consumer state. -
It will slow down our process since
sendToTopic
callback is called when Kafka returns the delivery report for this message. Delivery reports are invoked asynchronously after the producer has handed off messages to Kafka. Ifwe decide to block or wait synchronously for that delivery report every time we send a message, it will impact performance and throughput.
That is a fair point (and may actually help on another issue), but I don't really see how this is a problem for this change: handling an entry by the bucket processor typically takes at least one second already (scanning & checking the state of every object), so we face this discrepancy anyway... This change is simply about ensuring that the we keep the "slot" until the entry is "fully" processed, instead of leaving many things pending: which can be an issue esp. since we are listing pushing continuation messages. What am I missing here?
In theory yes; It is certainly a trade off, but consistent processing seems important as well: or do you think it is completely safe to leave all these messages dangling, and already start processing next message(s)? |
My understanding was that the goal of this PR is to prevent multiple lifecycle iterations (triggered by Conductor) from running in parallel. I just pointed out that the lag is based on the “locally consumed” offset rather than on a processed or stored offset. So even if we wait for an entry to be fully processed, it won't stop the bucket-lifecycle topic lag from being zero while there are still other bucket messages in the pipeline.
Regarding the internal lifecycle listing, it does not necessarily return a 1000 objects; it only includes those that meet the specified criteria (prefix, age, etc...) from the next 10,000 entries. We might even end up with a listing response containing only a few objects, or none at all. |
lifecycle task pushes new entries to bucket topic, but may commit before the entry is commited : which allows multiple lifeycle iterations to happen in parallel.
Issue: BB-641