[1/n] expose outbound deployment ids from replica actor #58345

abrarsheikh · 2025-10-31T18:13:42Z

Summary

Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction.

Motivation

Deployments call downstream deployments via handles in two ways:

Stored handles: Passed to __init__() and stored as attributes → self.model.func.remote()
Dynamic handles: Obtained at runtime via serve.get_deployment_handle() → model.func.remote()

Previously, there was no way to programmatically discover these dependencies from a running replica.

Implementation

Core Changes

ReplicaActor.list_outbound_deployments(): Returns List[DeploymentID] of all downstream deployments
- Recursively inspects user callable attributes to find stored handles (including nested in dicts/lists)
- Tracks dynamic handles created via get_deployment_handle() at runtime using a callback mechanism
Runtime tracking: Modified get_deployment_handle() to register handles when called from within a replica via ReplicaContext._handle_registration_callback

Next PR: #58350

Signed-off-by: abrar <[email protected]>

cursor · 2025-10-31T22:13:11Z

python/ray/serve/context.py

    _deployment_config: DeploymentConfig,
    rank: int,
    world_size: int,
+    handle_registration_callback: Optional[Callable[[str, str], None]] = None,


Bug: Type mismatch in callback for replica context

The handle_registration_callback parameter in _set_internal_replica_context has a type annotation mismatch. It's currently Callable[[str, str], None], but the ReplicaContext field and its actual invocation expect Callable[[DeploymentID], None]. This difference could lead to a runtime type error.

akyang-anyscale · 2025-11-03T19:24:40Z

in theory, could the recorded dynamic handles be different per replica process based on business logic? how would we compile deployment dag based on this?

abrarsheikh · 2025-11-03T19:44:48Z

in theory, could the recorded dynamic handles be different per replica process based on business logic? how would we compile deployment dag based on this?

This is possible, but I would assume it's rare.

Since we enrich the DAG over a period of time, I expect the DAG to be a good representation of reality. But ultimately, this is the best effort

abrarsheikh · 2025-11-03T19:51:25Z

Generally speaking, you are right, if there are two replicas for a deployment and if we keep switching between them, then the DAG will change over time. Not the best experience, but probably okay?

Another design choice we can make is, ensure that DAG is the same across all replicas; if they are different, then don't construct the DAG. Only downside to this is seeking information from all replicas can be expensive.

akyang-anyscale · 2025-11-03T19:55:08Z

could it be unified at some higher level (like deployment level instead of replicas)?

abrarsheikh · 2025-11-03T19:57:20Z

Yeah, we can do that here , instead of poking one replica, we can query all and union them.

python/ray/serve/_private/replica.py

Signed-off-by: abrar <[email protected]>

…yments

python/ray/serve/_private/replica.py

Signed-off-by: abrar <[email protected]>

abrarsheikh · 2025-11-06T19:51:24Z

python/ray/serve/_private/replica.py

+        scanner = _PyObjScanner(source_type=DeploymentHandle)
+        try:
+            handles = scanner.find_nodes((init_args, init_kwargs))
+
+            for handle in handles:
+                deployment_id = handle.deployment_id
+                seen_deployment_ids.add(deployment_id)
+        finally:
+            scanner.clear()


this can be cached, but not super important because list_outbound_deployments will be called infrequently.

what's the frequency?

exponential backoff starting from 1s then capped at 10mins

akyang-anyscale · 2025-11-06T21:11:07Z

python/ray/serve/_private/replica.py

+        scanner = _PyObjScanner(source_type=DeploymentHandle)
+        try:
+            handles = scanner.find_nodes((init_args, init_kwargs))
+
+            for handle in handles:
+                deployment_id = handle.deployment_id
+                seen_deployment_ids.add(deployment_id)
+        finally:
+            scanner.clear()


what's the frequency?

expose outbound deployment ids from replica actor

782720b

Signed-off-by: abrar <[email protected]>

abrarsheikh added the go add ONLY when ready to merge, run all tests label Oct 31, 2025

abrarsheikh changed the title ~~expose outbound deployment ids from replica actor~~ [1/n] expose outbound deployment ids from replica actor Oct 31, 2025

abrarsheikh requested review from akyang-anyscale and zcin October 31, 2025 18:27

abrarsheikh marked this pull request as ready for review October 31, 2025 22:11

abrarsheikh requested a review from a team as a code owner October 31, 2025 22:11

cursor bot reviewed Oct 31, 2025

View reviewed changes

ray-gardener bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Nov 1, 2025

zcin reviewed Nov 4, 2025

View reviewed changes

python/ray/serve/_private/replica.py Outdated Show resolved Hide resolved

python/ray/serve/_private/replica.py Show resolved Hide resolved

python/ray/serve/_private/replica.py Outdated Show resolved Hide resolved

abrarsheikh added 2 commits November 6, 2025 15:52

change how we scan for deployment handle

ea7d807

Signed-off-by: abrar <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into dag-of-deplo…

be35418

…yments

zcin approved these changes Nov 6, 2025

View reviewed changes

python/ray/serve/_private/replica.py Show resolved Hide resolved

update docstrings

297261e

Signed-off-by: abrar <[email protected]>

abrarsheikh commented Nov 6, 2025

View reviewed changes

akyang-anyscale approved these changes Nov 6, 2025

View reviewed changes

abrarsheikh merged commit b9ee3fe into master Nov 6, 2025
6 checks passed

abrarsheikh deleted the dag-of-deployments branch November 6, 2025 21:33

[1/n] expose outbound deployment ids from replica actor #58345

[1/n] expose outbound deployment ids from replica actor #58345

Conversation

abrarsheikh commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation

Core Changes

Uh oh!

cursor bot Oct 31, 2025

Choose a reason for hiding this comment

Bug: Type mismatch in callback for replica context

Uh oh!

akyang-anyscale commented Nov 3, 2025

Uh oh!

abrarsheikh commented Nov 3, 2025

Uh oh!

abrarsheikh commented Nov 3, 2025

Uh oh!

akyang-anyscale commented Nov 3, 2025

Uh oh!

abrarsheikh commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

akyang-anyscale Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

abrarsheikh Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

akyang-anyscale Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abrarsheikh commented Oct 31, 2025 •

edited

Loading