Skip to content

Conversation

@abrarsheikh
Copy link
Contributor

@abrarsheikh abrarsheikh commented Nov 1, 2025

Update deployment state to poll the replicas to collect information about outbound deployments. polling uses exponential backoff, capped at 10 minutes, starting at 1s.

we are polling from all replicas. This will give the most accurate result, but can also be expensive for deployments with 1000s of replicas. Polling only 1 replica can be suboptimal, should we query some fixed set for efficiency, like 10?

Next PR -> #58355

@abrarsheikh abrarsheikh changed the base branch from master to dag-of-deployments November 1, 2025 01:23
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Nov 1, 2025
abrarsheikh added a commit that referenced this pull request Nov 6, 2025
## Summary
Adds a new method to expose all downstream deployments that a replica
calls into, enabling dependency graph construction.

## Motivation
Deployments call downstream deployments via handles in two ways:
1. **Stored handles**: Passed to `__init__()` and stored as attributes →
`self.model.func.remote()`
2. **Dynamic handles**: Obtained at runtime via
`serve.get_deployment_handle()` → `model.func.remote()`

Previously, there was no way to programmatically discover these
dependencies from a running replica.

## Implementation

### Core Changes
- **`ReplicaActor.list_outbound_deployments()`**: Returns
`List[DeploymentID]` of all downstream deployments
- Recursively inspects user callable attributes to find stored handles
(including nested in dicts/lists)
- Tracks dynamic handles created via `get_deployment_handle()` at
runtime using a callback mechanism

- **Runtime tracking**: Modified `get_deployment_handle()` to register
handles when called from within a replica via
`ReplicaContext._handle_registration_callback`


Next PR: #58350

---------

Signed-off-by: abrar <[email protected]>
Base automatically changed from dag-of-deployments to master November 6, 2025 21:33
@abrarsheikh abrarsheikh marked this pull request as ready for review November 6, 2025 22:03
@abrarsheikh abrarsheikh requested a review from a team as a code owner November 6, 2025 22:03
Signed-off-by: abrar <[email protected]>
if self.curr_status_info.status == DeploymentStatus.HEALTHY:
self._outbound_poll_delay = min(
self._outbound_poll_delay * 2, self._max_outbound_poll_delay
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Exponential Backoff Fails for Non-Healthy Polls Improve

The exponential backoff for outbound deployments polling only increases the delay when the deployment status is HEALTHY. This means that during deployment updates, rollouts, or any non-HEALTHY states (UPDATING, UPSCALING, DOWNSCALING, etc.), the poll delay will never increase and replicas will be polled at the initial 1-second interval indefinitely. This defeats the purpose of exponential backoff and can cause excessive polling during long-running deployment operations. The backoff should increase based on successful polls, not deployment health status.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Nov 7, 2025
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…58345)

## Summary
Adds a new method to expose all downstream deployments that a replica
calls into, enabling dependency graph construction.

## Motivation
Deployments call downstream deployments via handles in two ways:
1. **Stored handles**: Passed to `__init__()` and stored as attributes →
`self.model.func.remote()`
2. **Dynamic handles**: Obtained at runtime via
`serve.get_deployment_handle()` → `model.func.remote()`

Previously, there was no way to programmatically discover these
dependencies from a running replica.

## Implementation

### Core Changes
- **`ReplicaActor.list_outbound_deployments()`**: Returns
`List[DeploymentID]` of all downstream deployments
- Recursively inspects user callable attributes to find stored handles
(including nested in dicts/lists)
- Tracks dynamic handles created via `get_deployment_handle()` at
runtime using a callback mechanism

- **Runtime tracking**: Modified `get_deployment_handle()` to register
handles when called from within a replica via
`ReplicaContext._handle_registration_callback`


Next PR: ray-project#58350

---------

Signed-off-by: abrar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants