-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ModernBERT] Add CausalLM functionality to ModernBERT #35946
base: main
Are you sure you want to change the base?
Conversation
Thanks for assigning @gante! Also cc'ing the others who may be interested again @Rocketknight1 @ArthurZucker. Let me know if I need to do anything else, I see I need to update the branch with main since it's been a while. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thanks for the PR 🤗
As you already anticipated, my main concern is that this is a different "architecture" per say! With modular writing ModernBertDecoder + ModernBertForCausalLM should be easier, but we usually never change the codebase of existing models like this!
Other than this, the tokenizer is not needed either! The settings you added can be controlled directly via the template processor
that needs to be different for each model ! 🤗
Thanks @ArthurZucker! I am not 100% sure if you are saying having them combined is a no-go or potentially still on the table, so I will give you my last attempt to describe why I think it's a good idea to have them combined.
What are your thoughts? You're the expert here, so if you still think it's best to separate them I will move the implementation to a |
@orionw a question: this would be the equivalent to what we have in the original If so, I'd be inclined to include your PR, given that it is a pattern present in such a staple model. One modeling file would be a near copy of the other. But @ArthurZucker has the final word here :) |
This PR adds CausalLM functionality to ModernBERT. This means:
is_causal
since otherwise we'd need to pad/repad too frequently. We lose a bit of speed, but the code is much cleaner.This is for an upcoming release that trained both options for a comparison of encoders and decoders.
Anticipated FAQs:
Q: Why not just make a new model class?
A: Models are trained in both settings and need to be able to operate back and forth with the same weights. This also reduces redundant code since all the base functionality is the same, other than kv caching and attention masking.
Q: Is there a test model?
A: Currently I have a test model at
https://huggingface.co/blab-jhu/test-32m-dec
that has been trained with CLM and works with this PR.Q: Why are there two methods of CausalLM?
A: One works for the encoder-only models, which perform better than I anticipated on decoder-only tasks. The other is standard decoder-only modeling, which is the default.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Tagging those who have reviewed ModernBERT PRs @Rocketknight1 @ArthurZucker