-
Notifications
You must be signed in to change notification settings - Fork 56
Infinite retries on failed requests #511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Doy-lee
wants to merge
39
commits into
oxen-io:dev
Choose a base branch
from
Doy-lee:doyle-swarms
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
f03ef80 to
211e6ab
Compare
Upcoming Session Router usage will use IPv6 (internally) to talk to relays, but currently relays are only listening on the IPv4 0.0.0.0 any address. This adds a second HTTPS and QUIC listener on the IPv6 any address (::) so that it will be accessible over session router.
- Updates HTTP server internals to properly handle IPv6 addresses, including updating the rate limiter to handle them. - Switch to using a dual-stack socket with uWebSocket (HTTP) because it forces dual stack on, and was failing to bind with separate IPv4/IPv6 sockets. - Make QUIC match dual-stack (although it doesn't strictly need to) for the sake of consistency. - Change internals for request handling to use IPv6 for requests and rate limiting, with IPv4 addresses using mapped IPv6 addresses.
Make HTTP & QUIC listen on IPv6
…eful for private networks)
We don't do anything in particular in the setter or getter that might warrant having those functions (like a lock or something) so we can remove the indirection and call the member directly.
…lper function Keep the SNDataReady serialisation code compartmentalised into one function that handles both a read and write of the data structure.
This is preliminary work to prep for persisting the swarm member's state to disk to allow resuming from the last known state. Currently when a storage server starts up, it assumes that it's joining a new swarm/new members are joining its swarm and does a full message dump to those members instead of being able to know if a swarm member is new or we already knew about it and then, choose a synchronisation method that is more suited for that particular scenario.
Restoring the swarm state means that the if the node is an active service node, it'll remember which swarm it was in so when the storage server gets restarted, it can correctly detect if a new SN is joining their swarm instead of assuming all the nodes have newly joined their swarm and consequently dump their entire SQL db to them.
When a node initiates a recursive swarm request, the initial node awaits the response from all other nodes before returning to the client. Children swarm nodes that fail to receive the request are stored into a retryable request queue to be re-attempted later. This queue is flushed every 3s by piggybacking onto the swarm member check function that is periodically invoked by OMQ.
… SNSerialiseResult
Comment states that only timed-out requests are retried. This is correct as an error response with error code and text are stiuations that might mean that the recipient node is not in a valid state or will ever accept the request in which case the safe default is to not retry to that node. It is possible in future since all possible error states are known to handle them specifically for the command. But for now, a sane default is to only allow retries to nodes that were offline or we failed to communicate with.
This was originally 10s and it was mistakenly changed to 30s. We've reverted all the changes to the swarm member checks so this should go back to its original values as retries are handled in a separate subsystem instead of intertwined with the member checks.
Move serialise result struct for retryable requests into impl file as it's only used locally.
The `new_swarm_member` flag disambiguates between nodes that need a DB dump vs nodes that don't sufficiently that we don't need the intermediate contact details ready state. Also remove unused member var.
Not sure why, it's used in the lambda itself but it does not need to be captured unlike the other 2 constexpr variables that are being captured and used. Alas CI is complaining about it and treating the warning as an error so removing it in this commit.
5454c37 to
e2684f2
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds a new thread that blocks on a queue with a list of requests that failed to be propagated to other nodes in the swarm. This happens when there's a recursive swarm request after the initial node has received the request. Failures in those requests are stored into the node that initiated the request and is retried periodically. Retries have an exponential back-off to avoid spamming and they are pruned if the request exceeds a specific age.
Most requests are allowed to be re-submitted way past their initial timestamp as long as they are signed. There are a few exceptions to this, such as delete all and expire all. These requests are still kept around and retried, but, once a retried request returns with an error the retry is dropped immediately. This applies to all requests so the retry queue is always actively managed and pruned.
For infinite retries we now serialise some state to the SQL DB. We introduce a new
runtime_statetable that stores 2 BT encoded blobs. One containing the composition of the swarms and the list of retryable requests.Storing the swarms lets the storage server remember what swarm it is in, and detect changes to swarms across restarts. Since it can now remember and detect when its swarm has changed it can now do a DB dump to new members that join the swarm whilst you are offline. This should improve the "correctness" of messages held by new members in the swarm in lieu of active syncing.
Storing the retryable requests lets the retries persist across restarts. This also helps improve the "correctness" of messages in the swarm in lieu of active syncing.