Skip to content

Conversation

@aavasthy
Copy link
Contributor

Pull Request Template

Description

During partition-level failover and failback under session consistency, a timing gap can cause read requests to fail with 404/1002 errors. When a partition temporarily fails over to a secondary region and later begins failing back to the primary region, the SDK’s read circuit breaker (PPCB) may start routing reads back to the primary region before it has fully caught up with the writes from the failover region. As a result, reads using session tokens from the previous write region may fail because the primary region does not yet have the corresponding session state. Since the SDK currently does not perform cross-regional retries for 404/1002 responses, these reads continue to fail until the primary region is fully synchronized. The goal is to leverage the new backend header x-ms-cosmos-hub-region-processing-only to detect such conditions and route retry requests to the correct write (hub) region, ensuring successful session-consistent reads during the failback window.

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #5440

@aavasthy aavasthy self-assigned this Oct 14, 2025
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@aavasthy aavasthy marked this pull request as ready for review October 14, 2025 17:18
@aavasthy aavasthy changed the title [Per Partition Automatic Failover] Use Hub Region Processing Only While Routing Requests Failed with 404/1002. Per Partition Automatic Failover: Adds Hub Region Processing Only While Routing Requests Failed with 404/1002. Oct 14, 2025
&& subStatusCode == SubStatusCodes.ReadSessionNotAvailable)
{
{
this.addHubRegionProcessingOnlyHeader = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag for addHubRegionProcessingOnlyHeader is set for all the instances of 404/1002(Read session not found) and consequently we would set the header for every 404/1002 retry, is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

ananth7592
ananth7592 previously approved these changes Oct 14, 2025
Copy link
Contributor

@ananth7592 ananth7592 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a comment

}

if (statusCode == HttpStatusCode.NotFound
&& subStatusCode == SubStatusCodes.ReadSessionNotAvailable)
Copy link
Member

@xinlian12 xinlian12 Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also double check: does this change also targeted for MM as well? For MM, writes can happen in any region, also enable this for MM might cause regression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed this with backend team and this change is not intended to be used in multi-master.

if (this.addHubRegionProcessingOnlyHeader)
{
request.Headers[HubRegionHeader] = bool.TrueString;
this.addHubRegionProcessingOnlyHeader = false; // reset after applying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the errors returned if SDK try to read from non-hub region?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also falling back to new hub for that partition.

Copy link
Member

@kirankumarkolli kirankumarkolli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for design document

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Per Partition Automatic Failover] Use Hub Region Processing Only While Routing Requests Failed with 404/1002

5 participants