Skip to content

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Sep 17, 2025

✨ Description

Major feature: Revisit model configuration.

  • Revisit config hierarchy. Main components are now:
    • mixer: Dynamic configuration for a mixer. Available: attention (default), mamba, mamba_2, discrete_mamba_2.
    • mlp: Dynamic configuration for a mlp. Available: mlp (default), moe.
    • block: Configuration for a transformer (or other) block. Available: decoder (default, standard transformer-decoder-style block with a mixer and a mlp).
    • block_sequence: Configuration for a sequence of blocks, ex. a transformer decoder. Available: fixed (default), pattern.
    • embeddings_layer / output_layer: New modular configuration for language model embeddings and head, extracted from language model config into modular (but not yet dynamic) components.
    • base_model: For the gpt model, this consists of embeddings_layer, a dynamic block_sequence and a model_head
  • Standardize and modularize parameter configuration with ParameterConfig. Parameter (meta) creation now goes through ParameterConfig.get_parameter. (Block interface: parameter and linear config, separate SSM config. #358)
  • Add modular configuration for linear layers. (Block interface: parameter and linear config, separate SSM config. #358)
  • Standardize and improve flexibility for initialization by adding a dynamic initialization config to ParameterConfig, with defaults set by the parent layer. Remove most pre-existing initialization config fields as they are no longer needed. (Block interface: rework LM config, fine-grained initialization, lr_scale, peft #360)
  • Standardize lr scale configuration with a lr_scale field in ParameterConfig and most layer configs. The resulting lr scale for a given parameter is the product of its own and that of all its parent layers. (Block interface: rework LM config, fine-grained initialization, lr_scale, peft #360)
  • Rework PEFT. Remove the specialization for transformers which was made to select which layers to enable by name. Instead, use the apply_peft field of linear configs, and default set by the parent layer. (Block interface: rework LM config, fine-grained initialization, lr_scale, peft #360)
  • Rework bias enabling configuration. Each (affine) linear layer config has its own enabling option, with default set by the parent layer. For Mixers and MLPs, this default is configurable with add_linear_biases. (Mixers and MLPs now use a separate field, as in HF transformers)
  • Adjust many field names to the new scheme.

Side feature: Modularize model components.

Since many components are now dynamic, the GPT base model is no longer able to keep track of all the internal details, so several hard-coded parts had to be replaced by modular methods that delegate to the appropriate components. This includes:

  • Preprocessor creation.
  • Loss definitions .
  • Block creation.
  • Compute usage estimation.

Side feature: Revisit model conversion. (#362)

  • Modularize converters to mimic the modularity of configs and modules. Define a full set of converters for Llama, and use it as a basis for other models which override/extend the relevant components.
  • Convert configs (dicts) directly instead of going through complicated converter objects.
  • Drop converters for starcoder2 and most SSMs, keeping only apriel_hybrid_ssm (previously apriel 15b) which supports both mamba 2 and discrete mamba 2.

Known issues / todo / notes

Critical:

  • These changes break backward compatibility for both models and configs. This should be manageable since HF conversion still enables some form of backward compatibility. However, we still need to explicitly prevent loading of old models which could cause trouble. I suggest bumping the version to 0.3.0 so we have a clear distinction between formats.
  • test_huggingface_model fails for SSMs because of bugs in the external model (lack of support for DynamicCache?). This is a huge problem because this test is our main correctness test for SSMs.
  • Check whether we need the removed SSM converters (and associated external models) or any other removed feature.

Important but could postpone:

  • Update documentation and tutorial.
  • test_huggingface_model fails for Mixtral, probably a conversion issue.
  • Concatenated weights (ex. key_value, in_proj) have separate configs but only use one of them.
  • Preprocessors are unsafe, as different block (ex. with different rotary config) may have different preprocessors writing different values to the same kwargs. Possibly applies to loss defs too.

Minor:

  • Pipeline-parallel checkpoint conversion not working for tied weights. (not new, but found recently)
  • Configure experts independently? (option to vary lr scale by expert removed)
  • Parallel tests are sometimes flaky due to crashes in distributed tests (nccl?)

@tscholak
Copy link
Collaborator

First impression is very good!

Base automatically changed from block_interface to main September 18, 2025 00:25
@jlamypoirier jlamypoirier marked this pull request as ready for review September 18, 2025 00:59
Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants