Skip to content

refactor(archon): extract utility functions and simplify engine code#954

Merged
garrett4wade merged 1 commit intomainfrom
rchardx/cleanup
Mar 2, 2026
Merged

refactor(archon): extract utility functions and simplify engine code#954
garrett4wade merged 1 commit intomainfrom
rchardx/cleanup

Conversation

@rchardx
Copy link
Copy Markdown
Collaborator

@rchardx rchardx commented Mar 2, 2026

Description

Extract six utility functions from ArchonEngine into a new archon_utils.py module for improved reuse and testability. Cache TP/CP parallel groups on initialization to eliminate repeated lookups in hot paths. Add context manager protocol to DistributedLock and convert manual acquire/release patterns to with statements for safer lock handling.

Related Issue

N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

Key changes:

  • archon_utils.py (new): create_optimizer, create_lr_scheduler, build_ac_config, validate_zero_bubble_compatibility, setup_deterministic_mode, validate_and_force_pad_to_maximum
  • archon_engine.py: Delegates to extracted utilities; caches _tp_group and _cp_group to avoid repeated parallel_dims.get_group() calls
  • archon_weight_sync.py: Replaces manual acquire()/release() with with engine.engine_lock: context manager
  • archon_checkpoint.py / archon_runner.py: Minor expression inlining
  • lock.py: Adds __enter__/__exit__ to DistributedLock
  • installation.md: Documents venv activation requirement

All changes are strictly behavioral no-ops — no logic or control flow modifications.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the maintainability and robustness of the Archon engine through significant code refactoring. It extracts complex logic into dedicated utility functions, streamlines the handling of parallel process groups by caching them, and adopts a safer, more idiomatic approach to distributed lock management using Python's context manager protocol. These changes are purely structural, ensuring no alteration to the existing behavior or functionality of the system.

Highlights

  • Utility Function Extraction: Six core utility functions related to optimizer creation, LR scheduling, activation checkpoint configuration, zero-bubble compatibility validation, deterministic mode setup, and pad_to_maximum validation have been extracted from ArchonEngine into a new archon_utils.py module. This significantly simplifies the engine's codebase and improves modularity and testability.
  • Parallel Group Caching: Tensor Parallel (TP) and Context Parallel (CP) process groups are now cached upon ArchonEngine initialization. This eliminates redundant lookups in performance-critical paths, leading to minor efficiency gains.
  • Safer Distributed Lock Handling: The DistributedLock class now implements the context manager protocol (__enter__ and __exit__ methods). This allows for its use with with statements, ensuring that locks are automatically acquired and released, preventing potential deadlocks or resource leaks.
  • Code Simplification and Inlining: Minor refactorings were performed across archon_checkpoint.py and archon_runner.py to inline temporary variables and simplify expressions, enhancing readability without altering functionality.
  • Documentation Update: The installation guide has been updated to explicitly mention the requirement and method for activating the virtual environment after installation, clarifying usage for pre-commit and git commit.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • areal/experimental/engine/archon_checkpoint.py
    • Inlined cpu_offload variable into StateDictOptions constructor.
    • Inlined output_tensor variable into torch.empty_like call.
    • Inlined hf_reader variable directly into dcp.load call.
  • areal/experimental/engine/archon_engine.py
    • Removed unused imports related to pipeline schedules and LR schedulers.
    • Inlined model_type variable directly into get_model_spec call.
    • Added caching for _tp_group and _cp_group process groups during initialization.
    • Replaced extensive inline logic with calls to new utility functions for zero-bubble compatibility, deterministic mode setup, and pad_to_maximum validation.
    • Delegated optimizer and LR scheduler creation to new utility functions.
    • Updated fsdp2_clip_grad_norm and ulysses_gather_output calls to use cached _tp_group and _cp_group.
    • Inlined pp_group and schedule_class variables in _apply_pipeline_parallelism.
    • Delegated activation checkpoint configuration to a new utility function.
    • Inlined model variable in _create_model_structure.
  • areal/experimental/engine/archon_runner.py
    • Inlined ctx_dict variable directly into process_output_fn calls in run, pp_loss_fn, and _process_outputs.
  • areal/experimental/engine/archon_utils.py
    • Added new file containing create_optimizer function.
    • Added create_lr_scheduler function.
    • Added build_ac_config function.
    • Added validate_zero_bubble_compatibility function.
    • Added setup_deterministic_mode function.
    • Added validate_and_force_pad_to_maximum function.
  • areal/experimental/engine/archon_weight_sync.py
    • Refactored init_weight_update_group to use with engine.engine_lock: for safer lock handling.
    • Refactored _update_bucket_weights to use with engine_lock: for safer lock handling.
    • Inlined temp_dtensor variable into DTensor.from_local call.
  • areal/utils/lock.py
    • Implemented __enter__ method for DistributedLock to acquire the lock.
    • Implemented __exit__ method for DistributedLock to release the lock.
  • docs/tutorial/installation.md
    • Added a new section detailing virtual environment activation requirements and methods.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the Archon engine. Key improvements include extracting complex setup logic from ArchonEngine into a new archon_utils.py module, which greatly enhances modularity and readability. The caching of Tensor Parallel and Context Parallel groups is a good performance optimization that avoids repeated lookups in hot paths. Furthermore, the adoption of context managers for DistributedLock makes resource handling safer and more idiomatic. The minor inlining and documentation updates are also positive changes. Overall, this is a high-quality refactoring with no issues found.

@rchardx rchardx added the safe-to-test Ready to run unit-tests in a PR. label Mar 2, 2026
@rchardx rchardx added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Mar 2, 2026
Move optimizer/scheduler creation, activation checkpoint config,
zero-bubble validation, deterministic mode setup, and pad_to_maximum
validation into archon_utils.py for reuse and testability. Cache
tp/cp parallel groups to avoid repeated lookups, and use context
managers for DistributedLock.

Key changes:
- Extract 6 utility functions into new archon_utils.py module
- Cache _tp_group and _cp_group on engine initialization
- Add __enter__/__exit__ to DistributedLock for context manager usage
- Replace manual lock acquire/release with `with` statements
- Add venv activation note to installation docs
@rchardx rchardx added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Mar 2, 2026
@rchardx rchardx temporarily deployed to AReaL-unittests March 2, 2026 11:43 — with GitHub Actions Inactive
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@garrett4wade garrett4wade merged commit c26bea9 into main Mar 2, 2026
8 checks passed
@garrett4wade garrett4wade deleted the rchardx/cleanup branch March 2, 2026 14:44
leandermaben pushed a commit to leandermaben/AReaL that referenced this pull request Mar 24, 2026
…nclusionAI#954)

Move optimizer/scheduler creation, activation checkpoint config,
zero-bubble validation, deterministic mode setup, and pad_to_maximum
validation into archon_utils.py for reuse and testability. Cache
tp/cp parallel groups to avoid repeated lookups, and use context
managers for DistributedLock.

Key changes:
- Extract 6 utility functions into new archon_utils.py module
- Cache _tp_group and _cp_group on engine initialization
- Add __enter__/__exit__ to DistributedLock for context manager usage
- Replace manual lock acquire/release with `with` statements
- Add venv activation note to installation docs
SathyaGnanakumar pushed a commit to danielkiely/AReaL that referenced this pull request Apr 29, 2026
…nclusionAI#954)

Move optimizer/scheduler creation, activation checkpoint config,
zero-bubble validation, deterministic mode setup, and pad_to_maximum
validation into archon_utils.py for reuse and testability. Cache
tp/cp parallel groups to avoid repeated lookups, and use context
managers for DistributedLock.

Key changes:
- Extract 6 utility functions into new archon_utils.py module
- Cache _tp_group and _cp_group on engine initialization
- Add __enter__/__exit__ to DistributedLock for context manager usage
- Replace manual lock acquire/release with `with` statements
- Add venv activation note to installation docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants