-
Notifications
You must be signed in to change notification settings - Fork 13.2k
model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367
Conversation
…ar projections Adding support for the Dense modules used in EmbeddingGemma models. EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone. See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/
My understanding of how SentenceTransformer works is that these modules are applied after the base model has produced its output. SentenceTransformer scans for numbered module directories (in this case 1_Pooling, 2_Dense, and 3_Dense) and applies them sequentially as post-processing steps. This came up during development and it was decided not to include any of these modules in the llama.cpp conversion. The model should output the base transformer embeddings directly. Pooling can be optionally configured in llama.cpp and can be done in several different ways (mean, CLS, last token, etc.) or not at all, depending on the user's needs. Including the Dense layers with a specific post-processing pipeline that assumes mean pooling will always be used, which reduces the flexibility of the pooling options provide. Additionally, users may want access to the raw token embeddings from the base model for their own use cases, rather than having the SentenceTransformer post-processing baked in. Keeping these as separate allows users to choose whether they want the SentenceTransformer behavior or the raw model outputs. That's at least my take on this matter, but if others disagree I'm open to these changes. I just wanted to provide some background on the reasoning. |
@danbev I had already anticipated the reasons why dense layers were not included in the first place. Let’s take the example of my own project, where I’m using Please see the example below: Base Model:
ST Model:
The results become even more different (not to say worse) when MRL-reduced dimensions are used. As a user, I would have preferred the following:
I could also imagine accommodating or implementing ST modules in a more generic way, similar to how LoRA adapters are handled. Sorry for making this so long, but this model is an important one for users like me. |
@sfallah Thanks for the detailed description - this is quite helpful. The main reason to not have support for Dense embedding modules implemented is that until recently (i.e. until our work with @danbev on EmbeddingGemma) I had no idea what their purpose is and how they are used. But now it is more clear. We should add some way to support that. It seems it would involve generalizing/extending the pooling logic/API as well as (optionally) incorporating the modules (i.e. tensors) into the GGUFs during conversion.
On first thought, the configuration of the dense modules would have to be done on the Since you have some first steps towards adding support for dense modules with this PR, we can continue with designing and implementing support for dense module configurations. Let me know if you are interested in putting some extra work into this, and I will try to provide steps how to proceed. |
@ggerganov |
@sfallah Thanks for the detailed explanation! This does seem very important for RAG use cases. I've added a #16387 for updating the model-conversion example (tool) which we've used for a few models now. I've tried this out with your pull request and it seems to work. Hopefully we can update as this as this work progresses and be prepared for future models that require the same type of features. |
I wonder if there is very simple solution that we can do:
This way a user can create a GGUF that either includes the dense modules or not depending on what they need. This makes the implementation much simpler as we don't have to extend the API. But it creates some burden for the user - they would have to be careful which GGUF they are using. In any case, doing it like this is basically a first step towards the more general support later that would allow to turn on and off the dense modules during context creation. |
@ggerganov In my opinion, it is not a burden for the user—at least not for me—to know if I will be deploying a gguf-model that includes dense layers or not. The same way, as I need to know which quantization type my gguf-model has. The only issue is flexibility regarding pooling, which I would, in practice, not see as a problem because of the following reasons:
I know the second point is a bit oversimplified, but in practice, it is generally true for So I don't see any problem if, for example, in the case of the |
…ar projections - converting model with dense-layers is optional - introduced dense config params
Overview of Changes
About Module ConfigurationBy reading the dense-module configuration, we lay the groundwork for full linear projection support. |
model: add support for EmbeddingGemma SentenceTransformers dense linear projections
Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.
See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/