Skip to content

Conversation

DNXie
Copy link
Member

@DNXie DNXie commented Oct 9, 2025

Context: #360

This PR adds automatic allocation tracking for all ForgeActor and ServiceInterface instances via the global Provisioner.
Actors and services are now automatically registered with the provisioner when spawned through .as_actor() or .as_service(), enabling unified lifecycle management and eliminating the need for manual shutdown calls.

Changes

  • ForgeActor.as_service()
    Registers the top-level ServiceInterface proxy (instead of raw ActorMesh) after initialization.
  • ForgeActor.as_actor()
    Registers the spawned ForgeActor proxy after successful setup.
  • Provisioner.shutdown_all_allocations()
    Handles both ForgeActor and ServiceInterface cleanly; internal ActorMesh handles are no longer registered.
  • Improved logging for spawn and teardown events.

Result

All actors and services are now automatically tracked and can be shut down gracefully with a single call.

Before

await policy.shutdown()
await ReplayBuffer.shutdown(...)
...
await shutdown()

After

await shutdown()

Test:

python -m tests.sandbox.toy_rl.sumdigits --config tests/sandbox/toy_rl/sumdigits.yaml

The log after ctrl+C (cleaned version):

...
loss/training_step: -0.6666666865348816 at training step 1
^CShutting down...
Shutting down actor DatasetActor
Shutting down service Policy
Shutting down actor Trainer
Shutting down actor ReplayBuffer
Shutting down service RewardActor
Shutting down service RefModel

TODO:

Merge

await mlogger.shutdown.call_one()

in the centralized shutdown.

Will do it after the Monarch error with the logger (#360 (3)) is fixed. Fixing it is out of scope of this PR.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 9, 2025
@DNXie DNXie changed the title Auto-track and globally shut down all Forge actors and services [WIP] Auto-track and globally shut down all Forge actors and services Oct 9, 2025
This method is used by `Service` to teardown a replica.
"""
if not quiet:
logger.info(f"Shutting down actor {getattr(actor, 'name', cls.__name__)}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this quiet check because otherwise when we shutdown a service, it would call actor.shutdown and print the log twice.

@DNXie DNXie changed the title [WIP] Auto-track and globally shut down all Forge actors and services Auto-track and globally shut down all Forge actors and services Oct 9, 2025
@DNXie DNXie marked this pull request as ready for review October 9, 2025 19:18

async def track_allocation(self, alloc: Any):
"""Tracks an allocation for cleanup."""
self._allocations.append(alloc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I think an even simpler approach is to just track the proc meshes right? We can just do await proc_mesh.stop() and I think everything inside of it should shut down neatly. Let me know if that doesn't work though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would work for actor since shutting down actor is essentially stopping the proc_mesh.
But for service, it involves some other operations such as stopping the replicas and healthy loop.

@DNXie DNXie requested a review from allenwang28 October 10, 2025 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants