-
Notifications
You must be signed in to change notification settings - Fork 15
Metric Logging updates 4/N #351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Metric Logging updates 4/N #351
Conversation
…estamp_logging_diff2
…estamp_logging_diff3
could you add some more comments in the description about what this PR is enabling? |
yes! sorry. I should have marked as a draft. I am doing a 2.5/4.0 before i ask you to review it |
self.timestamp = datetime.now(pytz.UTC).timestamp() | ||
|
||
|
||
def get_actor_name_with_rank() -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to observability/utils.py
if process_name is None: | ||
process_name = detect_actor_name_from_call_stack() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get name here and pass it to
local_fetcher_actor = proc.spawn(
"local_fetcher_actor", LocalFetcherActor, global_logger, process_name
)
this function is called in provisioner.py, and thats how we get the process_name
for every wandb run
logger = logging.getLogger(__name__) | ||
|
||
|
||
def detect_actor_name_from_call_stack() -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main file to review
in the near future the mesh might hold a name. When this happens, we can delete the function to use the call stack and just get it from the mesh. The rest of the PR stands. |
When logging per rank, we need good naming to make it easier to debug. I was previously trying to fetch this name from
monarch.actor.context
, but since whenMetricCollector.init_backends
is called, it is in the context of theLocalFetcherActor
, i would just get the namelocal_fetcher_actor
.To solve this, i use the call stack instead and call it when the actor is spawning ->
provisioner.py
->get_or_create_metric_logger
->detect_actor_name_from_call_stack()
. The utility then goes back until it finds aForgeActor
subclass, and get its name. This gets saved as theprocess_name
, used by the metric logging backends.User can also pass a process name as input. Thats how we get the
Controller
name.The process_name then goes:
LocalFetcherActor
->MetricCollector
->backend.init
-->wandb.init(name)
It does feel a bit brittle to traverse until finding a ForgeActor. But worst case scenario:
a) user can pass the process name
b) if we cannot find any, then its just UnknownActor