Skip to content

Conversation

felipemello1
Copy link
Contributor

@felipemello1 felipemello1 commented Oct 8, 2025

When logging per rank, we need good naming to make it easier to debug. I was previously trying to fetch this name from monarch.actor.context, but since when MetricCollector.init_backends is called, it is in the context of the LocalFetcherActor, i would just get the name local_fetcher_actor.

To solve this, i use the call stack instead and call it when the actor is spawning -> provisioner.py -> get_or_create_metric_logger -> detect_actor_name_from_call_stack(). The utility then goes back until it finds a ForgeActor subclass, and get its name. This gets saved as the process_name, used by the metric logging backends.

User can also pass a process name as input. Thats how we get the Controller name.

mlogger = await get_or_create_metric_logger(process_name="Controller")

The process_name then goes: LocalFetcherActor -> MetricCollector -> backend.init --> wandb.init(name)

It does feel a bit brittle to traverse until finding a ForgeActor. But worst case scenario:
a) user can pass the process name
b) if we cannot find any, then its just UnknownActor

image

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2025
@allenwang28
Copy link
Contributor

could you add some more comments in the description about what this PR is enabling?

@felipemello1
Copy link
Contributor Author

could you add some more comments in the description about what this PR is enabling?

yes! sorry. I should have marked as a draft. I am doing a 2.5/4.0 before i ask you to review it

@felipemello1 felipemello1 marked this pull request as draft October 8, 2025 22:37
@felipemello1 felipemello1 changed the title Metric Logging updates 3/4 [wip] Metric Logging updates 3/4 Oct 8, 2025
@felipemello1 felipemello1 changed the title [wip] Metric Logging updates 3/4 Metric Logging updates 4/N Oct 9, 2025
self.timestamp = datetime.now(pytz.UTC).timestamp()


def get_actor_name_with_rank() -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to observability/utils.py

@felipemello1 felipemello1 marked this pull request as ready for review October 9, 2025 03:27
Comment on lines +81 to +82
if process_name is None:
process_name = detect_actor_name_from_call_stack()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get name here and pass it to

local_fetcher_actor = proc.spawn(
            "local_fetcher_actor", LocalFetcherActor, global_logger, process_name
        )

this function is called in provisioner.py, and thats how we get the process_name for every wandb run

logger = logging.getLogger(__name__)


def detect_actor_name_from_call_stack() -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main file to review

@felipemello1
Copy link
Contributor Author

felipemello1 commented Oct 10, 2025

in the near future the mesh might hold a name. When this happens, we can delete the function to use the call stack and just get it from the mesh. The rest of the PR stands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants