Skip to content

Conversation

sfallah
Copy link

@sfallah sfallah commented Oct 1, 2025

model: add support for EmbeddingGemma SentenceTransformers dense linear projections

…ar projections

Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.

See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/
@sfallah sfallah requested a review from CISC as a code owner October 1, 2025 11:12
@github-actions github-actions bot added the python python script changes label Oct 1, 2025
@CISC CISC requested a review from danbev October 1, 2025 11:22
@danbev
Copy link
Member

danbev commented Oct 1, 2025

My understanding of how SentenceTransformer works is that these modules are applied after the base model has produced its output. SentenceTransformer scans for numbered module directories (in this case 1_Pooling, 2_Dense, and 3_Dense) and applies them sequentially as post-processing steps.

This came up during development and it was decided not to include any of these modules in the llama.cpp conversion. The model should output the base transformer embeddings directly. Pooling can be optionally configured in llama.cpp and can be done in several different ways (mean, CLS, last token, etc.) or not at all, depending on the user's needs.

Including the Dense layers with a specific post-processing pipeline that assumes mean pooling will always be used, which reduces the flexibility of the pooling options provide. Additionally, users may want access to the raw token embeddings from the base model for their own use cases, rather than having the SentenceTransformer post-processing baked in. Keeping these as separate allows users to choose whether they want the SentenceTransformer behavior or the raw model outputs.

That's at least my take on this matter, but if others disagree I'm open to these changes. I just wanted to provide some background on the reasoning.

@sfallah
Copy link
Author

sfallah commented Oct 1, 2025

@danbev
Thanks for your rapid feedback.

I had already anticipated the reasons why dense layers were not included in the first place.
And I understand your arguments for not including the additional SentenceTransformers (ST) modules — but in practice, for the client/user, it’s very helpful if they are already included.

Let’s take the example of my own project, where I’m using embeddinggemma for RAG.
First and foremost, I would face a quality issue if dense layers are not applied.
Besides the fact that, as far as I know, the MTEB benchmarks for embeddinggemma are done using the ST model (not just the base model), the quality issue can even be demonstrated with a trivial test.

Please see the example below:

Base Model:

A man is playing guitar:
	The dog plays in the garden: 0.5552932620048523
	A woman watches TV: 0.4669498801231384
	Do you like pizza?: 0.4650737941265106
	The new movie is so great: 0.438459575176239
I love pasta:
	Do you like pizza?: 0.676074206829071
	The new movie is so great: 0.5771214962005615
	A woman watches TV: 0.4970381259918213
	The dog plays in the garden: 0.4736071228981018

ST Model:

A man is playing guitar:
	The dog plays in the garden: 0.5132929682731628
	A woman watches TV: 0.43405574560165405
	The new movie is so great: 0.3694726824760437
	Do you like pizza?: 0.31990543007850647
I love pasta:
	Do you like pizza?: 0.6048902273178101
	A woman watches TV: 0.38842126727104187
	The new movie is so great: 0.3778345584869385
	The dog plays in the garden: 0.33146393299102783

The results become even more different (not to say worse) when MRL-reduced dimensions are used.
So for me as a user — someone who wants to use this for RAG and similar applications — only the full ST model is truly useful.
And if I were to apply the dense layers myself on the client side, it would be quite impractical and most likely inefficient.

As a user, I would have preferred the following:

  • That llama.cpp supports both base-model-only and ST-model modes.
  • That convert supports either full ST conversion or base-model-only conversion.
  • That both ST GGUF models and base models can be loaded.
  • (Optional) That applying the dense layers can be toggled in server embedding requests.

I could also imagine accommodating or implementing ST modules in a more generic way, similar to how LoRA adapters are handled.

Sorry for making this so long, but this model is an important one for users like me.
It’s very efficient and has a high MTEB ranking — but for that to hold true, the dense layers are crucial.

@ggerganov
Copy link
Member

@sfallah Thanks for the detailed description - this is quite helpful. The main reason to not have support for Dense embedding modules implemented is that until recently (i.e. until our work with @danbev on EmbeddingGemma) I had no idea what their purpose is and how they are used. But now it is more clear.

We should add some way to support that. It seems it would involve generalizing/extending the pooling logic/API as well as (optionally) incorporating the modules (i.e. tensors) into the GGUFs during conversion.

(Optional) That applying the dense layers can be toggled in server embedding requests.

On first thought, the configuration of the dense modules would have to be done on the llama_context level, so dynamically switching the modules per request might not be possible to support. But if this is an important use case, we can think of ways to accommodate it, thought it would require changes in both the server embedding API and the libllama API 🤔

Since you have some first steps towards adding support for dense modules with this PR, we can continue with designing and implementing support for dense module configurations. Let me know if you are interested in putting some extra work into this, and I will try to provide steps how to proceed.

@sfallah
Copy link
Author

sfallah commented Oct 2, 2025

@ggerganov
Thank you for taking the time to review this PR.
I’m very interested in collaborating with you and the team on this.

@danbev
Copy link
Member

danbev commented Oct 2, 2025

@sfallah Thanks for the detailed explanation! This does seem very important for RAG use cases.

I've added a #16387 for updating the model-conversion example (tool) which we've used for a few models now. I've tried this out with your pull request and it seems to work. Hopefully we can update as this as this work progresses and be prepared for future models that require the same type of features.

@ggerganov
Copy link
Member

I wonder if there is very simple solution that we can do:

  • Add options during the conversion to specify which dense modules to include
  • Update llama.cpp graphs to look for the dense modules as it is done in this PR and unconditionally apply the tensors to the graph if they are present

This way a user can create a GGUF that either includes the dense modules or not depending on what they need. This makes the implementation much simpler as we don't have to extend the API. But it creates some burden for the user - they would have to be careful which GGUF they are using.

In any case, doing it like this is basically a first step towards the more general support later that would allow to turn on and off the dense modules during context creation.

@sfallah
Copy link
Author

sfallah commented Oct 2, 2025

@ggerganov
I totally agree, if I may say, this is basically what I also meant with my suggestions.

In my opinion, it is not a burden for the user—at least not for me—to know if I will be deploying a gguf-model that includes dense layers or not. The same way, as I need to know which quantization type my gguf-model has.

The only issue is flexibility regarding pooling, which I would, in practice, not see as a problem because of the following reasons:

  1. What pooling type is "practically" applicable to what model is model-dependent.
  2. The option is practically binary: between the model's "default" pooling or none.

I know the second point is a bit oversimplified, but in practice, it is generally true for embedding and reranker models.

So I don't see any problem if, for example, in the case of the embeddinggemma model with dense layers included, the pooling set by the user is ignored, because dense layers require mean pooling.

…ar projections

- converting model with dense-layers is optional
- introduced dense config params
@sfallah sfallah requested a review from ggerganov as a code owner October 4, 2025 06:58
@sfallah
Copy link
Author

sfallah commented Oct 4, 2025

Overview of Changes

  • Added the --sentence-transformers-dense-modules conversion option to support including Sentence Transformers (ST) dense layers.
    • Currently, this option applies only to EmbeddingGemma.
  • Dense layers are now added to the graph when they are present in the GGUF file.
  • The configuration of dense modules is now read as the first step toward full, generic dense-module support.

About Module Configuration

By reading the dense-module configuration, we lay the groundwork for full linear projection support.
At the moment, EmbeddingGemma dense layers only include weight, and activation is identity.
But ST dense layers can represent complete linear projections, including both biases and non-identity activation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants