-
Notifications
You must be signed in to change notification settings - Fork 47
feat(Shovels): implement retry + backoff in Shovel destinations #1215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/lavinmq/shovel/shovel.cr
Outdated
| private def push_with_retry(path, headers, body, max_retries = 3, jitter = 0.5) | ||
| retries = 0 | ||
| while retries < max_retries | ||
| response = @client.not_nil!.post(path, headers: headers, body: body) | ||
| return true if response.success? | ||
| retries += 1 | ||
| # Exponential backoff: 2^retries seconds with jitter | ||
| base_delay = (2.0 ** retries).seconds | ||
| jitter_delay = jitter.seconds * Random.rand(0.0..1.0) | ||
| sleep base_delay + jitter_delay | ||
| end | ||
| false | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I think this could be a macro? trait? that receives the max_retries and jitter and executes a &block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, like this:
| private def push_with_retry(path, headers, body, max_retries = 3, jitter = 0.5) | |
| retries = 0 | |
| while retries < max_retries | |
| response = @client.not_nil!.post(path, headers: headers, body: body) | |
| return true if response.success? | |
| retries += 1 | |
| # Exponential backoff: 2^retries seconds with jitter | |
| base_delay = (2.0 ** retries).seconds | |
| jitter_delay = jitter.seconds * Random.rand(0.0..1.0) | |
| sleep base_delay + jitter_delay | |
| end | |
| false | |
| end | |
| private def push_with_retry(path, headers, body, max_retries = 3, jitter = 0.5.seconds, &) : Bool | |
| retries = 0 | |
| while retries < max_retries | |
| return true if yield | |
| retries += 1 | |
| # Exponential backoff: 2^retries seconds with jitter | |
| base_delay = (2.0 ** retries).seconds | |
| jitter_delay = jitter * rand(1.0) | |
| sleep base_delay + jitter_delay | |
| end | |
| false | |
| end |
|
Have not looked closely at the code, but does this adress issue #1357 aswell? |
|
@kickster97 Not related. It was meant for only the HTTP retries |
carlhoerberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this class needs a lot of rework, for instance, what's up with @client not being assigned in the constructor? What would allow us to get rid of the not_nil! anti-pattern.
src/lavinmq/shovel/shovel.cr
Outdated
| "/" | ||
| end | ||
| response = c.post(path, headers: headers, body: msg.body_io) | ||
| success = push_with_retry(path, headers, msg.body_io, max_retries: 3, jitter: 0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
push_with_retry makes sense for AckMode::OnConfirm and AckMode::OnPublish but NoAck is more like "fire-and-forget" so there we don't care about the results.
What I mean is that we should only retry on OnConfirm and OnPublish
| require "./spec_helper" | ||
| require "../src/lavinmq/shovel/shovel_retrier" | ||
|
|
||
| describe LavinMQ::Shovel::Retrier do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It got me wondering wether Retrier shouldn't just be some sort of Circuit Breaker...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain some more what you are thinking here? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snichme If we are doing HTTP Retries, there's a high possibility that blindly retrying is not the best approach. What if the user has N HTTP shovels, and their downstreams start failing?
This behavior of blindly retrying could just make the situation worse...And it's not a good UX to simply have the user deleting/pausing shovels one by one? (on the other hand for that specific scenario, we could make it simple as having a Pause All sorta button)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha agree.
I like it. I would prefer to have a checkbox next to each shovel, so you can select some of them and have batch operations on the selected ones, like Pause, Delete etc.
Like we have with Queues and Streams.
But this should be another PR IMO. so this wont run away in size or time
| name, | ||
| uri, | ||
| Shovel::HTTPDestinationParameters.from_parameters(config), | ||
| ack_mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/lavinmq/shovel/shovel.cr
Outdated
| jitter: parameters["dest-jitter"]?.try &.as_f? || 1.0, | ||
| backoff: parameters["dest-backoff"]?.try &.as_f? || 2.0, | ||
| timeout: parameters["dest-timeout"]?.try &.as_f? || 10.0, | ||
| max_retries: parameters["dest-max-retries"]?.try &.as_i? || 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 ?.try may fail to transform user input like 1.0 or 1 between Float and Integer. Not sure if ?.try is the best idea here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning, it could be misleading for the user if the Shovel Parameter accepts their input but behind the scenes, it uses the default values. Notice it via Spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reoccurring thing, how and when to validate input data and also handle it when it's invalid.
What you have here is common pattern that we try to parse the input and if it fails we use a default value, not saying it's the best since the shovel might end up being configured in a wrong way. But trying to find a common solution to all input is also difficult. @spuun did some refactoring on validating input for a queue here: https://github.com/cloudamqp/lavinmq/blob/main/src/lavinmq/amqp/queue/queue.cr#L27 might give some inspiration.
|
Draft PR but still something that will impact this PR if merged before: #1591 I can help rebase if needed. |
|
@snichme Ok. I am still waiting on that PR to be merged so I can push this further. |
WHAT is this pull request doing?
dest-backoffdest-jitterdest-timeoutdest-max-retriesno_ackmode into consideration - as all HTTPDestinations were defaulted as OnConfirm/OnPublish.HOW can this pull request be tested?
Specs:
RFC
?.tryis misleading. Is there a better way to handle it?Closes #1137