Skip to content

Conversation

@abrarsheikh
Copy link
Contributor

@abrarsheikh abrarsheikh commented Oct 31, 2025

Summary

Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction.

Motivation

Deployments call downstream deployments via handles in two ways:

  1. Stored handles: Passed to __init__() and stored as attributes → self.model.func.remote()
  2. Dynamic handles: Obtained at runtime via serve.get_deployment_handle()model.func.remote()

Previously, there was no way to programmatically discover these dependencies from a running replica.

Implementation

Core Changes

  • ReplicaActor.list_outbound_deployments(): Returns List[DeploymentID] of all downstream deployments

    • Recursively inspects user callable attributes to find stored handles (including nested in dicts/lists)
    • Tracks dynamic handles created via get_deployment_handle() at runtime using a callback mechanism
  • Runtime tracking: Modified get_deployment_handle() to register handles when called from within a replica via ReplicaContext._handle_registration_callback

Next PR: #58350

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Oct 31, 2025
@abrarsheikh abrarsheikh changed the title expose outbound deployment ids from replica actor [1/n] expose outbound deployment ids from replica actor Oct 31, 2025
@abrarsheikh abrarsheikh marked this pull request as ready for review October 31, 2025 22:11
@abrarsheikh abrarsheikh requested a review from a team as a code owner October 31, 2025 22:11
_deployment_config: DeploymentConfig,
rank: int,
world_size: int,
handle_registration_callback: Optional[Callable[[str, str], None]] = None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Type mismatch in callback for replica context

The handle_registration_callback parameter in _set_internal_replica_context has a type annotation mismatch. It's currently Callable[[str, str], None], but the ReplicaContext field and its actual invocation expect Callable[[DeploymentID], None]. This difference could lead to a runtime type error.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Nov 1, 2025
@akyang-anyscale
Copy link
Contributor

in theory, could the recorded dynamic handles be different per replica process based on business logic? how would we compile deployment dag based on this?

@abrarsheikh
Copy link
Contributor Author

in theory, could the recorded dynamic handles be different per replica process based on business logic? how would we compile deployment dag based on this?

This is possible, but I would assume it's rare.

Since we enrich the DAG over a period of time, I expect the DAG to be a good representation of reality. But ultimately, this is the best effort

@abrarsheikh
Copy link
Contributor Author

Generally speaking, you are right, if there are two replicas for a deployment and if we keep switching between them, then the DAG will change over time. Not the best experience, but probably okay?

Another design choice we can make is, ensure that DAG is the same across all replicas; if they are different, then don't construct the DAG. Only downside to this is seeking information from all replicas can be expensive.

@akyang-anyscale
Copy link
Contributor

could it be unified at some higher level (like deployment level instead of replicas)?

@abrarsheikh
Copy link
Contributor Author

Yeah, we can do that here , instead of poking one replica, we can query all and union them.

Signed-off-by: abrar <[email protected]>
Comment on lines +1249 to +1257
scanner = _PyObjScanner(source_type=DeploymentHandle)
try:
handles = scanner.find_nodes((init_args, init_kwargs))

for handle in handles:
deployment_id = handle.deployment_id
seen_deployment_ids.add(deployment_id)
finally:
scanner.clear()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be cached, but not super important because list_outbound_deployments will be called infrequently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the frequency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exponential backoff starting from 1s then capped at 10mins

Comment on lines +1249 to +1257
scanner = _PyObjScanner(source_type=DeploymentHandle)
try:
handles = scanner.find_nodes((init_args, init_kwargs))

for handle in handles:
deployment_id = handle.deployment_id
seen_deployment_ids.add(deployment_id)
finally:
scanner.clear()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the frequency?

@abrarsheikh abrarsheikh merged commit b9ee3fe into master Nov 6, 2025
6 checks passed
@abrarsheikh abrarsheikh deleted the dag-of-deployments branch November 6, 2025 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants