Skip to content

Support block-modular architecture #242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
tscholak opened this issue Apr 24, 2025 · 5 comments
Open
2 of 4 tasks

Support block-modular architecture #242

tscholak opened this issue Apr 24, 2025 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@tscholak
Copy link
Collaborator

tscholak commented Apr 24, 2025

🎯 Goal (What & Why)

Enable fully modular, per-block configuration in Fast-LLM to follow up on hybrid architecture support introduced in #194.

Currently, hybrid models (e.g., interleaving Mamba 1, Mamba 2, and Transformer blocks) are limited by global block-type configurations: all transformer blocks share one config, and all SSM blocks another. This is too rigid.

We want to:

  • Allow different configurations per block, even for the same type.
  • Support named blocks with configurable weight sharing.
  • Enable expressive, fine-grained architectures, useful for:
    • Exploring different attention mechanisms in a single model.
    • Tying weights across repeated block instances.
    • Designing sparse, pruned, or ablation-based stacks.
    • Preparing for model export pipelines with heterogeneous block stacks.

This would eliminate the current one-size-fits-all limitation and make model stacks in Fast-LLM truly composable and expressive.

🚀 Execution Plan

This is a config and model-construction feature. The main change is replacing the global transformer and ssm sections with a new per-block format.

Key Ideas

  • Add model.blocks: a dict of named block configs (e.g., alice, bob, claire, potato, etc., it doesn't matter what they are called, see example below).
  • Add block_pattern: a list specifying the block sequence by name.
  • Add num_layers: total depth of the model. The pattern repeats to reach this.
  • Allow block-level options like:
    • kind: transformer | ssm | ...
    • attention_type, sliding_window, dt_rank, etc.
    • shared_weights: true for parameter sharing
    • lora: ...
  • Blocks reused by name will share configuration; if shared_weights: true, they’ll also reuse parameters.

Minimal Implementation Path

  1. Define new schema and validate it (e.g., every pattern entry must resolve to a block).
  2. Update model construction to instantiate blocks from model.blocks, repeat pattern to reach num_layers.
  3. Add weight-sharing logic: instantiate shared blocks once, reuse parameters across layers.
  4. Add support for block-level LoRA injection.
  5. Maintain backwards compatibility: for existing models, fall back to current global transformer/ssm layout if model.blocks is absent. Save new checkpoints using the new format.
  6. Extend test coverage: - Stacks with different transformer configs
    • Mixed MQA/GQA/sliding-window blocks
    • Interleaved SSM and transformer blocks
    • Shared and unshared weights
  7. Update documentation with examples and migration guide.

Example Config: One block

model:
  blocks:
    default_transformer:
      kind: transformer
      attention_type: mqa
      use_flash_attention: true
      num_heads: 16
      hidden_size: 4096

  block_pattern: ["default_transformer"]
  num_layers: 48

Example Config: Many blocks

model:
  blocks:
    alice:
      kind: transformer
      attention_type: mqa
      sliding_window: false

    bob:
      kind: transformer
      attention_type: gqa
      sliding_window: true
      shared_weights: true

    claire:
      kind: ssm
      variant: mamba1
      dt_rank: auto

    dave:
      kind: ssm
      variant: discrete_mamba2
      state_size: 16

  block_pattern: ["alice", "bob", "claire", "dave", "bob"]
  num_layers: 15

Here:

  • Pattern repeats 3 times in the 15 layers of the model.
  • bob appears 6 times, but defines weights once (shared).
  • Each block can be configured independently.

📌 Acceptance Criteria

  • model.blocks is supported with flexible per-block config.
  • block_pattern resolves correctly and builds a full stack of layers.
  • Shared weights reduce parameter count where shared_weights: true is set.
  • Legacy config format (transformer, ssm) remains supported with deprecation warning for the time being.
  • Unit tests validate:
    • Per-block config behaviour
    • Mixed block types
    • Shared vs non-shared blocks
  • Documentation updated with clear example configs and usage patterns.

🛠️ Project Management

  • Assign the project to the Fast-LLM project.
  • Set the Estimate field (in days).
  • Use the Size field to categorize the PR size (Large).
  • Assign an owner when opening the issue.
@tscholak tscholak added the enhancement New feature or request label Apr 24, 2025
@jlamypoirier
Copy link
Collaborator

This is extremely similar to #155, #168, so we can try something similar, the main noticeable being:

  • Layer usage specified through a block_pattern. This is simpler and more user-friendly, but only when we want repeated block patterns. Eventually (for later) we'll want to think about more generic block configurations, ex. do something different in the first few layers.
  • We want to mix different block types. Concretely this means dynamic config classes. We can either do [feat] Generalize dynamic config classes #126 first or make a one-off hack like for datasets.
  • No default. The solution from [Prototype] Option to configure layers independently #168 clearly won't work because of dynamic classes, but a lack of default/override could make configs extremely verbose for complex cases. We do have hydra for setting up the initial configuration, but hydra's simplicity will go away after resolution so printed and saved configs will still be extremely hard to read. For future work we might want to add a default field to blocks to simplify things.

@jlamypoirier
Copy link
Collaborator

Also note that some of the issues from #168 will go away because of recent work, the lack of default and simpler pattern specification, but others will remain:

  • Adds complexity when not using the feature, and needs backward compatibility
  • Some parameters need to match between blocks, ex. hidden_size
  • Need to rethink TensorSpace
  • Conversion is difficult (if we want it)

@tscholak
Copy link
Collaborator Author

Thanks a lot for the comments, @jlamypoirier. I want to clarify that this proposal is not a variation of #155 or #168, and it is not about override machinery.

The goal here is to move away from mutation-based configuration with implicit semantics, and toward a declarative, per-block architecture spec that:

  1. Makes the model stack explicit and easy to reason about.
  2. Supports named block reuse and weight sharing.
  3. Represents both homogeneous and heterogeneous stacks cleanly.
  4. Aligns naturally with future use cases like export, distillation, and stacking.

Actually, config verbosity is not a problem, and In my opinion we should have never attempted to fix it. Instead we should be optimizing for clarity, expressiveness, and compatibility with tooling. Our configs are either machine-generated or visualized in tools like WandB, where search and filtering solve the discoverability problem. In fact, verbosity is a feature: it ensures that configs are fully resolved and auditable. That's exactly what we want when debugging or reproducing results.

By contrast, I find that the override-based system we were looking at before (as in #155) introduce unnecessary indirection and cognitive overhead. They solve the wrong problem and create new ones, like ambiguity, non-local reasoning, and fragile conversion logic.

This proposal solves the architecture representation problem. It introduces a clean, composable structure that can express things override-based systems can't: tied weights, named block templates, block-level LoRA injection, and more.

If we want to layer conveniences like defaults later, we can, but we can also just solve this with hydra. In any case, the core should remain simple, explicit, and composable.

@jlamypoirier
Copy link
Collaborator

This is about varying layer configuration across layers, so conceptually it is a different implementation of the exact same thing as #155, #168.

Fully resolved configs are sometime what we want but readability and understandability is a major concern. The default config is already 350-line long (and growing fast) when printed in its entirety. The proposed changes can easily bring resolved configs to 1000+ lines, 10000+ in the more complex cases. This is extremely difficult for humans to read and debug without access to the original hydra config, which is not available once the experiment is launched. Such verbosity is only needed a tiny fraction of the time (and we do have it either way), but otherwise causes major harm, and a summary is much more appropriate..

@tscholak
Copy link
Collaborator Author

Ok, let me clarify where we're coming from:

  • This proposal isn't trying to solve all future model configuration problems. It has a deliberately limited scope: making heterogeneous block stacks easy to express, audit, and work with in practice, specifically addressing immediate needs for current architectural experiments.
  • We intentionally lean toward explicit, verbose, fully-resolved configuration because that's what's most valuable to the key people training these models right now. Explicitness and verbosity are assets for debugging, inspection, searching configs in WandB, and reproducibility. In short, it's a practical win.
  • You're correct that there's some overlap with goals from Option to vary configuration parameters across layers #155, [Prototype] Option to configure layers independently #168, and related issues. But having reviewed those, my perspective is they target a different problem entirely: how to compactly express or override configuration across multiple layers. That's not the challenge we're addressing here. We're specifically focused on the representation problem: clearly and explicitly handling heterogeneous block types and their distinct settings, without needing overrides, inference logic, or complicated merging semantics.

Let me be blunt about the override-based approach: after reviewing it, I'm not convinced. It introduces complexity, edge cases, implicit behaviour, and challenging-to-audit logic, and would ultimately not address the practical needs we're prioritizing. Compact serialized configs just aren't a pressing concern for the team. In fact, the team actively prefers fully explicit configurations, regardless of verbosity, because it accurately reflects model reality and simplifies operational workflows.

Given all this, I think we should get started with this ticket as defined here ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants