-
Notifications
You must be signed in to change notification settings - Fork 26
fix - Metric logging work with new monarch API #451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix - Metric logging work with new monarch API #451
Conversation
…-metric-logging-reland
404ee1d
to
00ccf0c
Compare
can ignore the unit test failing. CI is using old monarch version. Error is harmless too. Tests pass locally. |
).as_service() | ||
generator = await GeneratorActor.options( | ||
**service_config, mesh_name="GeneratorActor" | ||
).as_service() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of mesh_name
here? Is it for the better display on wandb? Why is it not applied to all main files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | ||
await global_logger.register_fetcher.call_one(local_fetcher_actor, proc) | ||
# Generate a unique ID to map procmesh to fetcher | ||
proc._uid = str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mind approving it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little concerning about the broken CI. Wouldn't it cause all the subsequent commits to break as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think that errors are related to this PR. But let me confirm by opening a dummy PR
# Deregister local logger from global logger | ||
if hasattr(proc_mesh, "_local_fetcher"): | ||
# Deregister LocalFetcherActor from GlobalLoggingActor | ||
if hasattr(proc_mesh, "_local_fetcher") and hasattr(proc_mesh, "_uid"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for a proc_mesh that has _local_fetcher
but not _uid
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, they should always have both. I guess i was having extra safe here. Is it confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you write it like this to be safe. But I just worry it may hide some potential errors. How about raise an error if it has _local_fetcher
but not _uid
?
…-metric-logging-reland
Co-authored-by: Felipe Mello <[email protected]>
We landed #351, which adds better rank names to logging (later reverted in #429). This adds 351 back, but with fixes for monarch.
Goal of the original PR:

Monarch changed APIs, breaking the code in two ways:
1)
Situation: To get the
actor_name
andProcMesh
uid, i was usingmonarch.actor.context()
.Problem: Now actor_name returns
{actor_name}_{ProcMeshuid}
(orclient
if outside a PrcoMesh), and calling context.world_name errors if called outside of aProcMesh
context.Fix: Never call
.world_name
and get the uid from actor_name, if any.Situation: To allow
GlobalLoggingActor
to access theLocalFetcherActor
on eachProcMesh
, we would register the (key,value) as{ProcMesh: LocalFetcherActor}
Problem: The ProcMesh is not hashable anymore.
Solution: Instead of using ProcMesh as a key, we create a 'str' UUID for the proc.
Extra: updated docstrings to make clear whats going on