refactor(archon): extract utility functions and simplify engine code by rchardx · Pull Request #954 · inclusionAI/AReaL

rchardx · 2026-03-02T08:54:22Z

Description

Extract six utility functions from ArchonEngine into a new archon_utils.py module for improved reuse and testability. Cache TP/CP parallel groups on initialization to eliminate repeated lookups in hot paths. Add context manager protocol to DistributedLock and convert manual acquire/release patterns to with statements for safer lock handling.

Related Issue

N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

Key changes:

archon_utils.py (new): create_optimizer, create_lr_scheduler, build_ac_config, validate_zero_bubble_compatibility, setup_deterministic_mode, validate_and_force_pad_to_maximum
archon_engine.py: Delegates to extracted utilities; caches _tp_group and _cp_group to avoid repeated parallel_dims.get_group() calls
archon_weight_sync.py: Replaces manual acquire()/release() with with engine.engine_lock: context manager
archon_checkpoint.py / archon_runner.py: Minor expression inlining
lock.py: Adds __enter__/__exit__ to DistributedLock
installation.md: Documents venv activation requirement

All changes are strictly behavioral no-ops — no logic or control flow modifications.

gemini-code-assist · 2026-03-02T08:54:42Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the maintainability and robustness of the Archon engine through significant code refactoring. It extracts complex logic into dedicated utility functions, streamlines the handling of parallel process groups by caching them, and adopts a safer, more idiomatic approach to distributed lock management using Python's context manager protocol. These changes are purely structural, ensuring no alteration to the existing behavior or functionality of the system.

Highlights

Utility Function Extraction: Six core utility functions related to optimizer creation, LR scheduling, activation checkpoint configuration, zero-bubble compatibility validation, deterministic mode setup, and pad_to_maximum validation have been extracted from ArchonEngine into a new archon_utils.py module. This significantly simplifies the engine's codebase and improves modularity and testability.
Parallel Group Caching: Tensor Parallel (TP) and Context Parallel (CP) process groups are now cached upon ArchonEngine initialization. This eliminates redundant lookups in performance-critical paths, leading to minor efficiency gains.
Safer Distributed Lock Handling: The DistributedLock class now implements the context manager protocol (__enter__ and __exit__ methods). This allows for its use with with statements, ensuring that locks are automatically acquired and released, preventing potential deadlocks or resource leaks.
Code Simplification and Inlining: Minor refactorings were performed across archon_checkpoint.py and archon_runner.py to inline temporary variables and simplify expressions, enhancing readability without altering functionality.
Documentation Update: The installation guide has been updated to explicitly mention the requirement and method for activating the virtual environment after installation, clarifying usage for pre-commit and git commit.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

areal/experimental/engine/archon_checkpoint.py
- Inlined cpu_offload variable into StateDictOptions constructor.
- Inlined output_tensor variable into torch.empty_like call.
- Inlined hf_reader variable directly into dcp.load call.
areal/experimental/engine/archon_engine.py
- Removed unused imports related to pipeline schedules and LR schedulers.
- Inlined model_type variable directly into get_model_spec call.
- Added caching for _tp_group and _cp_group process groups during initialization.
- Replaced extensive inline logic with calls to new utility functions for zero-bubble compatibility, deterministic mode setup, and pad_to_maximum validation.
- Delegated optimizer and LR scheduler creation to new utility functions.
- Updated fsdp2_clip_grad_norm and ulysses_gather_output calls to use cached _tp_group and _cp_group.
- Inlined pp_group and schedule_class variables in _apply_pipeline_parallelism.
- Delegated activation checkpoint configuration to a new utility function.
- Inlined model variable in _create_model_structure.
areal/experimental/engine/archon_runner.py
- Inlined ctx_dict variable directly into process_output_fn calls in run, pp_loss_fn, and _process_outputs.
areal/experimental/engine/archon_utils.py
- Added new file containing create_optimizer function.
- Added create_lr_scheduler function.
- Added build_ac_config function.
- Added validate_zero_bubble_compatibility function.
- Added setup_deterministic_mode function.
- Added validate_and_force_pad_to_maximum function.
areal/experimental/engine/archon_weight_sync.py
- Refactored init_weight_update_group to use with engine.engine_lock: for safer lock handling.
- Refactored _update_bucket_weights to use with engine_lock: for safer lock handling.
- Inlined temp_dtensor variable into DTensor.from_local call.
areal/utils/lock.py
- Implemented __enter__ method for DistributedLock to acquire the lock.
- Implemented __exit__ method for DistributedLock to release the lock.
docs/tutorial/installation.md
- Added a new section detailing virtual environment activation requirements and methods.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring of the Archon engine. Key improvements include extracting complex setup logic from ArchonEngine into a new archon_utils.py module, which greatly enhances modularity and readability. The caching of Tensor Parallel and Context Parallel groups is a good performance optimization that avoids repeated lookups in hot paths. Furthermore, the adoption of context managers for DistributedLock makes resource handling safer and more idiomatic. The minor inlining and documentation updates are also positive changes. Overall, this is a high-quality refactoring with no issues found.

Move optimizer/scheduler creation, activation checkpoint config, zero-bubble validation, deterministic mode setup, and pad_to_maximum validation into archon_utils.py for reuse and testability. Cache tp/cp parallel groups to avoid repeated lookups, and use context managers for DistributedLock. Key changes: - Extract 6 utility functions into new archon_utils.py module - Cache _tp_group and _cp_group on engine initialization - Add __enter__/__exit__ to DistributedLock for context manager usage - Replace manual lock acquire/release with `with` statements - Add venv activation note to installation docs

garrett4wade

LGTM

…nclusionAI#954) Move optimizer/scheduler creation, activation checkpoint config, zero-bubble validation, deterministic mode setup, and pad_to_maximum validation into archon_utils.py for reuse and testability. Cache tp/cp parallel groups to avoid repeated lookups, and use context managers for DistributedLock. Key changes: - Extract 6 utility functions into new archon_utils.py module - Cache _tp_group and _cp_group on engine initialization - Add __enter__/__exit__ to DistributedLock for context manager usage - Replace manual lock acquire/release with `with` statements - Add venv activation note to installation docs

gemini-code-assist Bot reviewed Mar 2, 2026

View reviewed changes

rchardx added the safe-to-test Ready to run unit-tests in a PR. label Mar 2, 2026

rchardx requested review from fishcrap, garrett4wade and nuzant March 2, 2026 08:59

rchardx had a problem deploying to AReaL-unittests March 2, 2026 09:07 — with GitHub Actions Error

rchardx force-pushed the rchardx/cleanup branch from 615e426 to 5e2e602 Compare March 2, 2026 09:27

rchardx added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Mar 2, 2026

rchardx had a problem deploying to AReaL-unittests March 2, 2026 09:33 — with GitHub Actions Failure

rchardx force-pushed the rchardx/cleanup branch from 5e2e602 to 45725b9 Compare March 2, 2026 10:42

rchardx force-pushed the rchardx/cleanup branch from 45725b9 to e548ebb Compare March 2, 2026 11:36

rchardx added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Mar 2, 2026

rchardx temporarily deployed to AReaL-unittests March 2, 2026 11:43 — with GitHub Actions Inactive

garrett4wade approved these changes Mar 2, 2026

View reviewed changes

garrett4wade merged commit c26bea9 into main Mar 2, 2026
8 checks passed

garrett4wade deleted the rchardx/cleanup branch March 2, 2026 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(archon): extract utility functions and simplify engine code#954

refactor(archon): extract utility functions and simplify engine code#954
garrett4wade merged 1 commit intomainfrom
rchardx/cleanup

rchardx commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

garrett4wade left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rchardx commented Mar 2, 2026

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants