Metric Logging updates 4/N #351

felipemello1 · 2025-10-08T19:16:44Z

When logging per rank, we need good naming to make it easier to debug. I was previously trying to fetch this name from monarch.actor.context, but since when MetricCollector.init_backends is called, it is in the context of the LocalFetcherActor, i would just get the name local_fetcher_actor.

To solve this, i use the call stack instead and call it when the actor is spawning -> provisioner.py -> get_or_create_metric_logger -> detect_actor_name_from_call_stack(). The utility then goes back until it finds a ForgeActor subclass, and get its name. This gets saved as the process_name, used by the metric logging backends.

User can also pass a process name as input. Thats how we get the Controller name.

mlogger = await get_or_create_metric_logger(process_name="Controller")

The process_name then goes: LocalFetcherActor -> MetricCollector -> backend.init --> wandb.init(name)

It does feel a bit brittle to traverse until finding a ForgeActor. But worst case scenario:
a) user can pass the process name
b) if we cannot find any, then its just UnknownActor

…estamp_logging_diff2

…estamp_logging_diff3

allenwang28 · 2025-10-08T22:35:46Z

could you add some more comments in the description about what this PR is enabling?

felipemello1 · 2025-10-08T22:37:35Z

could you add some more comments in the description about what this PR is enabling?

yes! sorry. I should have marked as a draft. I am doing a 2.5/4.0 before i ask you to review it

apps/grpo/qwen3_1_7b.yaml

felipemello1 · 2025-10-09T03:22:41Z

src/forge/observability/metrics.py

            self.timestamp = datetime.now(pytz.UTC).timestamp()


-def get_actor_name_with_rank() -> str:


moved to observability/utils.py

…estamp_logging_diff3

felipemello1 · 2025-10-09T20:16:10Z

src/forge/observability/metric_actors.py

+    if process_name is None:
+        process_name = detect_actor_name_from_call_stack()


get name here and pass it to

local_fetcher_actor = proc.spawn( "local_fetcher_actor", LocalFetcherActor, global_logger, process_name )

this function is called in provisioner.py, and thats how we get the process_name for every wandb run

felipemello1 · 2025-10-09T20:16:26Z

src/forge/observability/utils.py

+logger = logging.getLogger(__name__)
+
+
+def detect_actor_name_from_call_stack() -> str:


main file to review

felipemello1 · 2025-10-10T22:54:36Z

in the near future the mesh might hold a name. When this happens, we can delete the function to use the call stack and just get it from the mesh. The rest of the PR stands.

Felipe Mello added 11 commits October 8, 2025 08:38

commit

77488cf

commit

feb4771

update backend role typehints and enum

41ceaa4

update where we check FORGE_DISABLE_METRICS

8a24e71

remove protected import

3f3bc51

Merge branch 'timestamp_logging_diff1' into timestamp_logging_diff2

d82c354

protect import

4fe2611

Merge branch 'timestamp_logging_diff1' into timestamp_logging_diff2

8759bc8

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

fbb4a9e

…estamp_logging_diff2

record_metric uses dataclass Metric

d81a4ed

commit

1e2255d

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2025

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

a94c612

…estamp_logging_diff3

felipemello1 marked this pull request as draft October 8, 2025 22:37

felipemello1 changed the title ~~Metric Logging updates 3/4~~ [wip] Metric Logging updates 3/4 Oct 8, 2025

Felipe Mello added 4 commits October 8, 2025 19:03

commit

5b477e8

commit

f2b3eed

revert

471b88a

Merge branch 'timestamp_logging_diff2_5' into timestamp_logging_diff3

1a02784

felipemello1 changed the title ~~[wip] Metric Logging updates 3/4~~ Metric Logging updates 4/N Oct 9, 2025

felipemello1 commented Oct 9, 2025

View reviewed changes

apps/grpo/qwen3_1_7b.yaml Show resolved Hide resolved

remove unnecessary code

fa4895f

felipemello1 commented Oct 9, 2025

View reviewed changes

better logging

7bb1fe7

felipemello1 marked this pull request as ready for review October 9, 2025 03:27

Felipe Mello added 3 commits October 9, 2025 07:23

docs/names

43d5d27

Merge branch 'timestamp_logging_diff2_5' into timestamp_logging_diff3

c97eb98

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

70e9c67

…estamp_logging_diff3

felipemello1 requested a review from allenwang28 October 9, 2025 19:52

felipemello1 commented Oct 9, 2025

View reviewed changes

update cfg back to true

1186aec

felipemello1 assigned joecummings and ebsmothers Oct 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metric Logging updates 4/N #351

Metric Logging updates 4/N #351

Uh oh!

felipemello1 commented Oct 8, 2025 •

edited

Loading

Uh oh!

allenwang28 commented Oct 8, 2025

Uh oh!

felipemello1 commented Oct 8, 2025

Uh oh!

Uh oh!

felipemello1 Oct 9, 2025

Uh oh!

felipemello1 Oct 9, 2025

Uh oh!

felipemello1 Oct 9, 2025

Uh oh!

felipemello1 commented Oct 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		self.timestamp = datetime.now(pytz.UTC).timestamp()


		def get_actor_name_with_rank() -> str:

		if process_name is None:
		process_name = detect_actor_name_from_call_stack()

		logger = logging.getLogger(__name__)


		def detect_actor_name_from_call_stack() -> str:

Metric Logging updates 4/N #351

Are you sure you want to change the base?

Metric Logging updates 4/N #351

Uh oh!

Conversation

felipemello1 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allenwang28 commented Oct 8, 2025

Uh oh!

felipemello1 commented Oct 8, 2025

Uh oh!

Uh oh!

felipemello1 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

felipemello1 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

felipemello1 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

felipemello1 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

felipemello1 commented Oct 8, 2025 •

edited

Loading

felipemello1 commented Oct 10, 2025 •

edited

Loading