-
Notifications
You must be signed in to change notification settings - Fork 29
Support block-modular architecture #242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is extremely similar to #155, #168, so we can try something similar, the main noticeable being:
|
Also note that some of the issues from #168 will go away because of recent work, the lack of default and simpler pattern specification, but others will remain:
|
Thanks a lot for the comments, @jlamypoirier. I want to clarify that this proposal is not a variation of #155 or #168, and it is not about override machinery. The goal here is to move away from mutation-based configuration with implicit semantics, and toward a declarative, per-block architecture spec that:
Actually, config verbosity is not a problem, and In my opinion we should have never attempted to fix it. Instead we should be optimizing for clarity, expressiveness, and compatibility with tooling. Our configs are either machine-generated or visualized in tools like WandB, where search and filtering solve the discoverability problem. In fact, verbosity is a feature: it ensures that configs are fully resolved and auditable. That's exactly what we want when debugging or reproducing results. By contrast, I find that the override-based system we were looking at before (as in #155) introduce unnecessary indirection and cognitive overhead. They solve the wrong problem and create new ones, like ambiguity, non-local reasoning, and fragile conversion logic. This proposal solves the architecture representation problem. It introduces a clean, composable structure that can express things override-based systems can't: tied weights, named block templates, block-level LoRA injection, and more. If we want to layer conveniences like defaults later, we can, but we can also just solve this with hydra. In any case, the core should remain simple, explicit, and composable. |
This is about varying layer configuration across layers, so conceptually it is a different implementation of the exact same thing as #155, #168. Fully resolved configs are sometime what we want but readability and understandability is a major concern. The default config is already 350-line long (and growing fast) when printed in its entirety. The proposed changes can easily bring resolved configs to 1000+ lines, 10000+ in the more complex cases. This is extremely difficult for humans to read and debug without access to the original hydra config, which is not available once the experiment is launched. Such verbosity is only needed a tiny fraction of the time (and we do have it either way), but otherwise causes major harm, and a summary is much more appropriate.. |
Ok, let me clarify where we're coming from:
Let me be blunt about the override-based approach: after reviewing it, I'm not convinced. It introduces complexity, edge cases, implicit behaviour, and challenging-to-audit logic, and would ultimately not address the practical needs we're prioritizing. Compact serialized configs just aren't a pressing concern for the team. In fact, the team actively prefers fully explicit configurations, regardless of verbosity, because it accurately reflects model reality and simplifies operational workflows. Given all this, I think we should get started with this ticket as defined here ASAP. |
🎯 Goal (What & Why)
Enable fully modular, per-block configuration in Fast-LLM to follow up on hybrid architecture support introduced in #194.
Currently, hybrid models (e.g., interleaving Mamba 1, Mamba 2, and Transformer blocks) are limited by global block-type configurations: all transformer blocks share one config, and all SSM blocks another. This is too rigid.
We want to:
This would eliminate the current one-size-fits-all limitation and make model stacks in Fast-LLM truly composable and expressive.
🚀 Execution Plan
This is a config and model-construction feature. The main change is replacing the global
transformer
andssm
sections with a new per-block format.Key Ideas
model.blocks
: a dict of named block configs (e.g.,alice
,bob
,claire
,potato
, etc., it doesn't matter what they are called, see example below).block_pattern
: a list specifying the block sequence by name.num_layers
: total depth of the model. The pattern repeats to reach this.kind: transformer | ssm | ...
attention_type
,sliding_window
,dt_rank
, etc.shared_weights: true
for parameter sharinglora: ...
shared_weights: true
, they’ll also reuse parameters.Minimal Implementation Path
model.blocks
, repeat pattern to reachnum_layers
.transformer
/ssm
layout ifmodel.blocks
is absent. Save new checkpoints using the new format.Example Config: One block
Example Config: Many blocks
Here:
bob
appears 6 times, but defines weights once (shared).📌 Acceptance Criteria
model.blocks
is supported with flexible per-block config.block_pattern
resolves correctly and builds a full stack of layers.shared_weights: true
is set.transformer
,ssm
) remains supported with deprecation warning for the time being.🛠️ Project Management
Estimate
field (in days).Size
field to categorize the PR size (Large).The text was updated successfully, but these errors were encountered: