ai/live: Fix race on publish close, advance trickle seq on empty writes #3824
+56
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The change in [1] introduced a bit of a race condition and uncovered a separate issue that would lead to dozens of rapid-fire GET calls from the orchestrator's local subscriber to the same nonexistent segment.
The race condition was this: on deleting a channel, the publisher first closes the preconnect to clean up its own state, which triggers a segment close on the server with zero bytes written. Then the publisher DELETEs the channel itself.
However, on closing zero-byte segments, the server did not increment the sequence number for the next expected segment. This would cause two problems:
Subscribers that set the seq to the "next" segment (-1) would keep getting the same zero-byte segment back until the channel was deleted.
This is what happened to us: the orch runs a trickle local subscriber that continuously fetches the leading edge segment, but it would immediately return with zero bytes just before the channel is deleted. Because this is a local subscriber, it would be repeating this dozens of times until the DELETE got through.
Subscribers that handle their own sequence numbering (eg, incrementing it after a successful read; there is nothing inherently wrong with a zero-byte segment) would see an error if it fetched the next segment in the sequence, since the server does not allow for preconnects more than one segment ahead.
Address this in two ways:
Have the publisher delete the channel then close its own preconnect, rather than the other way around. This addresses the immediate issue of repeated retries: because the channel is marked as deleted first, any later retries see a nonexistent channel.
Treat zero-byte segments as valid on the server and increment the expected sequence number once a zero-byte segment closes. This would also have prevented this issue even without the publisher fix (at the expense of one more preconnect) and allows us to gracefully handle non-updated publishers or scenarios that raise similar behaviors.
[1] #3802