Skip to content

Conversation

felipemello1
Copy link
Contributor

@felipemello1 felipemello1 commented Oct 17, 2025

We landed #351, which adds better rank names to logging (later reverted in #429). This adds 351 back, but with fixes for monarch.

Goal of the original PR:
image

Monarch changed APIs, breaking the code in two ways:
1)
Situation: To get the actor_name and ProcMesh uid, i was using monarch.actor.context().
Problem: Now actor_name returns {actor_name}_{ProcMeshuid} (or client if outside a PrcoMesh), and calling context.world_name errors if called outside of a ProcMesh context.
Fix: Never call .world_name and get the uid from actor_name, if any.

Situation: To allow GlobalLoggingActor to access the LocalFetcherActor on each ProcMesh, we would register the (key,value) as {ProcMesh: LocalFetcherActor}
Problem: The ProcMesh is not hashable anymore.
Solution: Instead of using ProcMesh as a key, we create a 'str' UUID for the proc.


Extra: updated docstrings to make clear whats going on

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 17, 2025
@felipemello1 felipemello1 changed the title fix - Metric logging work with new monarch API [draft] fix - Metric logging work with new monarch API Oct 17, 2025
@felipemello1 felipemello1 marked this pull request as draft October 17, 2025 15:35
@felipemello1 felipemello1 force-pushed the fix-metric-logging-reland branch from 404ee1d to 00ccf0c Compare October 17, 2025 17:33
@felipemello1 felipemello1 marked this pull request as ready for review October 17, 2025 17:47
@felipemello1 felipemello1 changed the title [draft] fix - Metric logging work with new monarch API fix - Metric logging work with new monarch API Oct 17, 2025
@felipemello1
Copy link
Contributor Author

can ignore the unit test failing. CI is using old monarch version. Error is harmless too. Tests pass locally.

).as_service()
generator = await GeneratorActor.options(
**service_config, mesh_name="GeneratorActor"
).as_service()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of mesh_name here? Is it for the better display on wandb? Why is it not applied to all main files?

Copy link
Contributor Author

@felipemello1 felipemello1 Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)
await global_logger.register_fetcher.call_one(local_fetcher_actor, proc)
# Generate a unique ID to map procmesh to fetcher
proc._uid = str(uuid.uuid4())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! LGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind approving it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerning about the broken CI. Wouldn't it cause all the subsequent commits to break as well?

Copy link
Contributor Author

@felipemello1 felipemello1 Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think that errors are related to this PR. But let me confirm by opening a dummy PR

# Deregister local logger from global logger
if hasattr(proc_mesh, "_local_fetcher"):
# Deregister LocalFetcherActor from GlobalLoggingActor
if hasattr(proc_mesh, "_local_fetcher") and hasattr(proc_mesh, "_uid"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for a proc_mesh that has _local_fetcher but not _uid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, they should always have both. I guess i was having extra safe here. Is it confusing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you write it like this to be safe. But I just worry it may hide some potential errors. How about raise an error if it has _local_fetcher but not _uid?

@felipemello1 felipemello1 merged commit 2810162 into meta-pytorch:main Oct 17, 2025
8 of 9 checks passed
DNXie pushed a commit to DNXie/forge that referenced this pull request Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants