ai/live: Fix race on publish close, advance trickle seq on empty writes #3824

j0sh · 2025-12-06T01:34:23Z

The change in [1] introduced a bit of a race condition and uncovered a separate issue that would lead to dozens of rapid-fire GET calls from the orchestrator's local subscriber to the same nonexistent segment.

The race condition was this: on deleting a channel, the publisher first closes the preconnect to clean up its own state, which triggers a segment close on the server with zero bytes written. Then the publisher DELETEs the channel itself.

However, on closing zero-byte segments, the server did not increment the sequence number for the next expected segment. This would cause two problems:

Subscribers that set the seq to the "next" segment (-1) would keep getting the same zero-byte segment back until the channel was deleted.

This is what happened to us: the orch runs a trickle local subscriber that continuously fetches the leading edge segment, but it would immediately return with zero bytes just before the channel is deleted. Because this is a local subscriber, it would be repeating this dozens of times until the DELETE got through.
Subscribers that handle their own sequence numbering (eg, incrementing it after a successful read; there is nothing inherently wrong with a zero-byte segment) would see an error if it fetched the next segment in the sequence, since the server does not allow for preconnects more than one segment ahead.

Address this in two ways:

Have the publisher delete the channel then close its own preconnect, rather than the other way around. This addresses the immediate issue of repeated retries: because the channel is marked as deleted first, any later retries see a nonexistent channel.
Treat zero-byte segments as valid on the server and increment the expected sequence number once a zero-byte segment closes. This would also have prevented this issue even without the publisher fix (at the expense of one more preconnect) and allows us to gracefully handle non-updated publishers or scenarios that raise similar behaviors.

[1] #3802

The change in [1] introduced a bit of a race condition and uncovered a separate issue that would lead to dozens of rapid-fire GET calls from the orchestrator's local subscriber to the same nonexistent segment. The race condition was this: on deleting a channel, the publisher first closes the preconnect to clean up its own state, which triggers a segment close on the server with zero bytes written. Then the publisher DELETEs the channel itself. However, on closing zero-byte segments, the server did not increment the sequence number for the next expected segment. This would cause two problems: * Subscribers that set the seq to the "next" segment (-1) would keep getting the same zero-byte segment back until the channel was deleted. This is what happened to us: the orch runs a trickle local subscriber that continuously fetches the segment on the leading edge, but it would immediately return with zero bytes just before the channel is deleted. Because this is a local subscriber, it would be repeating this dozens of times until the DELETE got through. * Subscribers that handle their own sequence numbering (eg, incrementing it after a successful read; there is nothing inherently wrong with a zero-byte segment) would see an error if it fetched the next segment in the sequence, since the server does not allow for preconnects more than one segment ahead. Address this in two ways: * Have the publisher delete the channel then close its own preconnect, rather than the other way around. This addresses the immediate issue of repeated retries: because the channel is marked as deleted first, any later retries see a nonexistent channel. * Treat zero-byte segments as valid on the server and increment the expected sequence number once a zero-byte segment closes. This would also have prevented this issue even without the publisher fix (at the expense of one more preconnect) and allows us to gracefully handle non-updated publishers or scenarios that raise similar behaviors. [1] #3802

j0sh requested review from leszko, mjh1 and victorges December 6, 2025 01:34

github-actions bot added the go Pull requests that update Go code label Dec 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ai/live: Fix race on publish close, advance trickle seq on empty writes #3824

ai/live: Fix race on publish close, advance trickle seq on empty writes #3824

j0sh commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ai/live: Fix race on publish close, advance trickle seq on empty writes #3824

Are you sure you want to change the base?

ai/live: Fix race on publish close, advance trickle seq on empty writes #3824

Conversation

j0sh commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants