Block interface: full changes #363

jlamypoirier · 2025-09-17T02:53:41Z

✨ Description

Major feature: Revisit model configuration.

Revisit config hierarchy. Main components are now:
- mixer: Dynamic configuration for a mixer. Available: attention (default), mamba, mamba_2, discrete_mamba_2.
- mlp: Dynamic configuration for a mlp. Available: mlp (default), moe.
- block: Configuration for a transformer (or other) block. Available: decoder (default, standard transformer-decoder-style block with a mixer and a mlp).
- block_sequence: Configuration for a sequence of blocks, ex. a transformer decoder. Available: fixed (default), pattern.
- embeddings_layer / output_layer: New modular configuration for language model embeddings and head, extracted from language model config into modular (but not yet dynamic) components.
- base_model: For the gpt model, this consists of embeddings_layer, a dynamic block_sequence and a model_head
Standardize and modularize parameter configuration with ParameterConfig. Parameter (meta) creation now goes through ParameterConfig.get_parameter. (Block interface: parameter and linear config, separate SSM config. #358)
Add modular configuration for linear layers. (Block interface: parameter and linear config, separate SSM config. #358)
Standardize and improve flexibility for initialization by adding a dynamic initialization config to ParameterConfig, with defaults set by the parent layer. Remove most pre-existing initialization config fields as they are no longer needed. (Block interface: rework LM config, fine-grained initialization, lr_scale, peft #360)
Standardize lr scale configuration with a lr_scale field in ParameterConfig and most layer configs. The resulting lr scale for a given parameter is the product of its own and that of all its parent layers. (Block interface: rework LM config, fine-grained initialization, lr_scale, peft #360)
Rework PEFT. Remove the specialization for transformers which was made to select which layers to enable by name. Instead, use the apply_peft field of linear configs, and default set by the parent layer. (Block interface: rework LM config, fine-grained initialization, lr_scale, peft #360)
Rework bias enabling configuration. Each (affine) linear layer config has its own enabling option, with default set by the parent layer. For Mixers and MLPs, this default is configurable with add_linear_biases. (Mixers and MLPs now use a separate field, as in HF transformers)
Adjust many field names to the new scheme.

Side feature: Modularize model components.

Since many components are now dynamic, the GPT base model is no longer able to keep track of all the internal details, so several hard-coded parts had to be replaced by modular methods that delegate to the appropriate components. This includes:

Preprocessor creation.
Loss definitions .
Block creation.
Compute usage estimation.

Side feature: Revisit model conversion. (#362)

Modularize converters to mimic the modularity of configs and modules. Define a full set of converters for Llama, and use it as a basis for other models which override/extend the relevant components.
Convert configs (dicts) directly instead of going through complicated converter objects.
Drop converters for starcoder2 and most SSMs, keeping only apriel_hybrid_ssm (previously apriel 15b) which supports both mamba 2 and discrete mamba 2.

Known issues / todo / notes

Critical:

These changes break backward compatibility for both models and configs. This should be manageable since HF conversion still enables some form of backward compatibility. However, we still need to explicitly prevent loading of old models which could cause trouble. I suggest bumping the version to 0.3.0 so we have a clear distinction between formats.
test_huggingface_model fails for SSMs because of bugs in the external model (lack of support for DynamicCache?). This is a huge problem because this test is our main correctness test for SSMs.
Check whether we need the removed SSM converters (and associated external models) or any other removed feature.

Important but could postpone:

Update documentation and tutorial.
test_huggingface_model fails for Mixtral, probably a conversion issue.
Concatenated weights (ex. key_value, in_proj) have separate configs but only use one of them.
Preprocessors are unsafe, as different block (ex. with different rotary config) may have different preprocessors writing different values to the same kwargs. Possibly applies to loss defs too.

Minor:

Pipeline-parallel checkpoint conversion not working for tied weights. (not new, but found recently)
Configure experts independently? (option to vary lr scale by expert removed)
Parallel tests are sometimes flaky due to crashes in distributed tests (nccl?)

tscholak · 2025-09-17T17:45:06Z

First impression is very good!

…config

…fine_grained

tscholak

LGTM!!!

jlamypoirier added 30 commits July 21, 2025 17:17

TP mamba

82eed2b

TP mamba

4e310c7

fix

3cc4118

fix

9f7f75c

fixes

4054e04

fix

0014cc6

fixes

47ad548

fixes

6a074fa

Update external

d66651f

SSM debugging

50083ba

Merge branch 'main' into tp_mamba

5006328

Merge branch 'debug_mamba' into tp_mamba

13176bd

stuff

7b32699

Merge branch 'debug_mamba' into tp_mamba

73f591f

stuff

1feccc8

misc

e528b50

misc

b49c42f

Merge branch 'debug_mamba' into tp_mamba

bb4dcd9

misc

c1b7f44

misc

31f5d41

Merge branch 'debug_mamba' into tp_mamba

051bb07

misc

0a9ff25

Parallel discrete mamba 2

e7d9636

Mamba 2, misc

c14b764

doc

b605bd2

fix

5eea938

Merge branch 'debug_mamba' into tp_mamba

0a3e2a7

fixes

2e6d082

misc

b6c8613

Merge remote-tracking branch 'origin/main' into debug_mamba

f0c04cf

jlamypoirier added 14 commits August 27, 2025 15:52

fix

3f4a8ba

stuff

9741ba0

fixes

be69677

Simplify bias options

82a70aa

stuff

680980a

Dynamic mlp and block layer creation

3ef7860

stuff

ecad96b

fix

3fd092c

stuff

1a3497c

stuff

b6e7fce

stuff

4dfe2a4

misc

4185741

stuff

7763296

fix

8249f8a

jlamypoirier added 4 commits September 17, 2025 17:50

Merge branch 'main' into concatenated_dim

188587e

Merge branch 'concatenated_dim' into tp_mamba

e111509

Merge branch 'tp_mamba' into block_interface

95e0231

Merge remote-tracking branch 'origin/main' into block_interface

e076c7a

Base automatically changed from block_interface to main September 18, 2025 00:25

jlamypoirier added 6 commits September 17, 2025 20:52

Merge branch 'block_interface' into block_interface_weight

2315ac4

Merge remote-tracking branch 'origin/main' into block_interface_weight

79356f7

Merge branch 'block_interface_weight' into block_interface_mixer_mlp_…

e4198a6

…config

Merge branch 'block_interface_mixer_mlp_config' into block_interface_…

7abf263

…fine_grained

Merge branch 'block_interface_fine_grained' into block_interface_tflops

bfc9f84

Merge branch 'block_interface_tflops' into block_interface_convert

e68f96c

jlamypoirier marked this pull request as ready for review September 18, 2025 00:59

tscholak approved these changes Sep 18, 2025

View reviewed changes

Merge branch 'main' into block_interface_global

d71f0c6

jlamypoirier closed this Sep 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Block interface: full changes #363

Block interface: full changes #363

jlamypoirier commented Sep 17, 2025 •

edited

Loading

Uh oh!

tscholak commented Sep 17, 2025

Uh oh!

tscholak left a comment

Uh oh!

Uh oh!

Block interface: full changes #363

Block interface: full changes #363

Conversation

jlamypoirier commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Major feature: Revisit model configuration.

Side feature: Modularize model components.

Side feature: Revisit model conversion. (#362)

Known issues / todo / notes

Uh oh!

tscholak commented Sep 17, 2025

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jlamypoirier commented Sep 17, 2025 •

edited

Loading