-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[1/n] expose outbound deployment ids from replica actor #58345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: abrar <[email protected]>
| _deployment_config: DeploymentConfig, | ||
| rank: int, | ||
| world_size: int, | ||
| handle_registration_callback: Optional[Callable[[str, str], None]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Type mismatch in callback for replica context
The handle_registration_callback parameter in _set_internal_replica_context has a type annotation mismatch. It's currently Callable[[str, str], None], but the ReplicaContext field and its actual invocation expect Callable[[DeploymentID], None]. This difference could lead to a runtime type error.
|
in theory, could the recorded dynamic handles be different per replica process based on business logic? how would we compile deployment dag based on this? |
This is possible, but I would assume it's rare. Since we enrich the DAG over a period of time, I expect the DAG to be a good representation of reality. But ultimately, this is the best effort |
|
Generally speaking, you are right, if there are two replicas for a deployment and if we keep switching between them, then the DAG will change over time. Not the best experience, but probably okay? Another design choice we can make is, ensure that DAG is the same across all replicas; if they are different, then don't construct the DAG. Only downside to this is seeking information from all replicas can be expensive. |
|
could it be unified at some higher level (like deployment level instead of replicas)? |
|
Yeah, we can do that here , instead of poking one replica, we can query all and union them. |
Signed-off-by: abrar <[email protected]>
| scanner = _PyObjScanner(source_type=DeploymentHandle) | ||
| try: | ||
| handles = scanner.find_nodes((init_args, init_kwargs)) | ||
|
|
||
| for handle in handles: | ||
| deployment_id = handle.deployment_id | ||
| seen_deployment_ids.add(deployment_id) | ||
| finally: | ||
| scanner.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be cached, but not super important because list_outbound_deployments will be called infrequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the frequency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exponential backoff starting from 1s then capped at 10mins
| scanner = _PyObjScanner(source_type=DeploymentHandle) | ||
| try: | ||
| handles = scanner.find_nodes((init_args, init_kwargs)) | ||
|
|
||
| for handle in handles: | ||
| deployment_id = handle.deployment_id | ||
| seen_deployment_ids.add(deployment_id) | ||
| finally: | ||
| scanner.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the frequency?
Summary
Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction.
Motivation
Deployments call downstream deployments via handles in two ways:
__init__()and stored as attributes →self.model.func.remote()serve.get_deployment_handle()→model.func.remote()Previously, there was no way to programmatically discover these dependencies from a running replica.
Implementation
Core Changes
ReplicaActor.list_outbound_deployments(): ReturnsList[DeploymentID]of all downstream deploymentsget_deployment_handle()at runtime using a callback mechanismRuntime tracking: Modified
get_deployment_handle()to register handles when called from within a replica viaReplicaContext._handle_registration_callbackNext PR: #58350