Skip to content

Conversation

@j0sh
Copy link
Collaborator

@j0sh j0sh commented Dec 6, 2025

The change in [1] introduced a bit of a race condition and uncovered a separate issue that would lead to dozens of rapid-fire GET calls from the orchestrator's local subscriber to the same nonexistent segment.

The race condition was this: on deleting a channel, the publisher first closes the preconnect to clean up its own state, which triggers a segment close on the server with zero bytes written. Then the publisher DELETEs the channel itself.

However, on closing zero-byte segments, the server did not increment the sequence number for the next expected segment. This would cause two problems:

  • Subscribers that set the seq to the "next" segment (-1) would keep getting the same zero-byte segment back until the channel was deleted.

    This is what happened to us: the orch runs a trickle local subscriber that continuously fetches the leading edge segment, but it would immediately return with zero bytes just before the channel is deleted. Because this is a local subscriber, it would be repeating this dozens of times until the DELETE got through.

  • Subscribers that handle their own sequence numbering (eg, incrementing it after a successful read; there is nothing inherently wrong with a zero-byte segment) would see an error if it fetched the next segment in the sequence, since the server does not allow for preconnects more than one segment ahead.

Address this in two ways:

  • Have the publisher delete the channel then close its own preconnect, rather than the other way around. This addresses the immediate issue of repeated retries: because the channel is marked as deleted first, any later retries see a nonexistent channel.

  • Treat zero-byte segments as valid on the server and increment the expected sequence number once a zero-byte segment closes. This would also have prevented this issue even without the publisher fix (at the expense of one more preconnect) and allows us to gracefully handle non-updated publishers or scenarios that raise similar behaviors.

[1] #3802

The change in [1] introduced a bit of a race condition and uncovered a
separate issue that would lead to dozens of rapid-fire GET calls from
the orchestrator's local subscriber to the same nonexistent segment.

The race condition was this: on deleting a channel, the publisher first
closes the preconnect to clean up its own state, which triggers a segment
close on the server with zero bytes written. Then the publisher DELETEs the
channel itself.

However, on closing zero-byte segments, the server did not increment
the sequence number for the next expected segment. This would cause two
problems:

* Subscribers that set the seq to the "next" segment (-1) would keep getting
  the same zero-byte segment back until the channel was deleted.

  This is what happened to us: the orch runs a trickle local subscriber that
  continuously fetches the segment on the leading edge, but it would
  immediately return with zero bytes just before the channel is deleted. Because
  this is a local subscriber, it would be repeating this dozens of times
  until the DELETE got through.

* Subscribers that handle their own sequence numbering (eg, incrementing
  it after a successful read; there is nothing inherently wrong with a
  zero-byte segment) would see an error if it fetched the next segment
  in the sequence, since the server does not allow for preconnects more
  than one segment ahead.

Address this in two ways:

* Have the publisher delete the channel then close its own preconnect,
  rather than the other way around. This addresses the immediate issue
  of repeated retries: because the channel is marked as deleted first,
  any later retries see a nonexistent channel.

* Treat zero-byte segments as valid on the server and increment the
  expected sequence number once a zero-byte segment closes. This would
  also have prevented this issue even without the publisher fix (at the
  expense of one more preconnect) and allows us to gracefully handle
  non-updated publishers or scenarios that raise similar behaviors.

[1] #3802
@j0sh j0sh requested review from leszko, mjh1 and victorges December 6, 2025 01:34
@github-actions github-actions bot added the go Pull requests that update Go code label Dec 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants