model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367

sfallah · 2025-10-01T11:12:34Z

model: add support for EmbeddingGemma SentenceTransformers dense linear projections

Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.
See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/

…support

…ar projections Adding support for the Dense modules used in EmbeddingGemma models. EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone. See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/

danbev · 2025-10-01T14:07:00Z

My understanding of how SentenceTransformer works is that these modules are applied after the base model has produced its output. SentenceTransformer scans for numbered module directories (in this case 1_Pooling, 2_Dense, and 3_Dense) and applies them sequentially as post-processing steps.

This came up during development and it was decided not to include any of these modules in the llama.cpp conversion. The model should output the base transformer embeddings directly. Pooling can be optionally configured in llama.cpp and can be done in several different ways (mean, CLS, last token, etc.) or not at all, depending on the user's needs.

Including the Dense layers with a specific post-processing pipeline that assumes mean pooling will always be used, which reduces the flexibility of the pooling options provide. Additionally, users may want access to the raw token embeddings from the base model for their own use cases, rather than having the SentenceTransformer post-processing baked in. Keeping these as separate allows users to choose whether they want the SentenceTransformer behavior or the raw model outputs.

That's at least my take on this matter, but if others disagree I'm open to these changes. I just wanted to provide some background on the reasoning.

sfallah · 2025-10-01T16:39:42Z

@danbev
Thanks for your rapid feedback.

I had already anticipated the reasons why dense layers were not included in the first place.
And I understand your arguments for not including the additional SentenceTransformers (ST) modules — but in practice, for the client/user, it’s very helpful if they are already included.

Let’s take the example of my own project, where I’m using embeddinggemma for RAG.
First and foremost, I would face a quality issue if dense layers are not applied.
Besides the fact that, as far as I know, the MTEB benchmarks for embeddinggemma are done using the ST model (not just the base model), the quality issue can even be demonstrated with a trivial test.

Please see the example below:

Base Model:

A man is playing guitar:
	The dog plays in the garden: 0.5552932620048523
	A woman watches TV: 0.4669498801231384
	Do you like pizza?: 0.4650737941265106
	The new movie is so great: 0.438459575176239
I love pasta:
	Do you like pizza?: 0.676074206829071
	The new movie is so great: 0.5771214962005615
	A woman watches TV: 0.4970381259918213
	The dog plays in the garden: 0.4736071228981018

ST Model:

A man is playing guitar:
	The dog plays in the garden: 0.5132929682731628
	A woman watches TV: 0.43405574560165405
	The new movie is so great: 0.3694726824760437
	Do you like pizza?: 0.31990543007850647
I love pasta:
	Do you like pizza?: 0.6048902273178101
	A woman watches TV: 0.38842126727104187
	The new movie is so great: 0.3778345584869385
	The dog plays in the garden: 0.33146393299102783

The results become even more different (not to say worse) when MRL-reduced dimensions are used.
So for me as a user — someone who wants to use this for RAG and similar applications — only the full ST model is truly useful.
And if I were to apply the dense layers myself on the client side, it would be quite impractical and most likely inefficient.

As a user, I would have preferred the following:

That llama.cpp supports both base-model-only and ST-model modes.
That convert supports either full ST conversion or base-model-only conversion.
That both ST GGUF models and base models can be loaded.
(Optional) That applying the dense layers can be toggled in server embedding requests.

I could also imagine accommodating or implementing ST modules in a more generic way, similar to how LoRA adapters are handled.

Sorry for making this so long, but this model is an important one for users like me.
It’s very efficient and has a high MTEB ranking — but for that to hold true, the dense layers are crucial.

ggerganov · 2025-10-02T07:34:11Z

@sfallah Thanks for the detailed description - this is quite helpful. The main reason to not have support for Dense embedding modules implemented is that until recently (i.e. until our work with @danbev on EmbeddingGemma) I had no idea what their purpose is and how they are used. But now it is more clear.

We should add some way to support that. It seems it would involve generalizing/extending the pooling logic/API as well as (optionally) incorporating the modules (i.e. tensors) into the GGUFs during conversion.

(Optional) That applying the dense layers can be toggled in server embedding requests.

On first thought, the configuration of the dense modules would have to be done on the llama_context level, so dynamically switching the modules per request might not be possible to support. But if this is an important use case, we can think of ways to accommodate it, thought it would require changes in both the server embedding API and the libllama API 🤔

Since you have some first steps towards adding support for dense modules with this PR, we can continue with designing and implementing support for dense module configurations. Let me know if you are interested in putting some extra work into this, and I will try to provide steps how to proceed.

sfallah · 2025-10-02T08:05:47Z

@ggerganov
Thank you for taking the time to review this PR.
I’m very interested in collaborating with you and the team on this.

danbev · 2025-10-02T10:47:04Z

@sfallah Thanks for the detailed explanation! This does seem very important for RAG use cases.

I've added a #16387 for updating the model-conversion example (tool) which we've used for a few models now. I've tried this out with your pull request and it seems to work. Hopefully we can update as this as this work progresses and be prepared for future models that require the same type of features.

ggerganov · 2025-10-02T10:59:35Z

I wonder if there is very simple solution that we can do:

Add options during the conversion to specify which dense modules to include
Update llama.cpp graphs to look for the dense modules as it is done in this PR and unconditionally apply the tensors to the graph if they are present

This way a user can create a GGUF that either includes the dense modules or not depending on what they need. This makes the implementation much simpler as we don't have to extend the API. But it creates some burden for the user - they would have to be careful which GGUF they are using.

In any case, doing it like this is basically a first step towards the more general support later that would allow to turn on and off the dense modules during context creation.

sfallah · 2025-10-02T12:14:55Z

@ggerganov
I totally agree, if I may say, this is basically what I also meant with my suggestions.

In my opinion, it is not a burden for the user—at least not for me—to know if I will be deploying a gguf-model that includes dense layers or not. The same way, as I need to know which quantization type my gguf-model has.

The only issue is flexibility regarding pooling, which I would, in practice, not see as a problem because of the following reasons:

What pooling type is "practically" applicable to what model is model-dependent.
The option is practically binary: between the model's "default" pooling or none.

I know the second point is a bit oversimplified, but in practice, it is generally true for embedding and reranker models.

So I don't see any problem if, for example, in the case of the embeddinggemma model with dense layers included, the pooling set by the user is ignored, because dense layers require mean pooling.

…ar projections - converting model with dense-layers is optional - introduced dense config params

sfallah · 2025-10-04T08:17:45Z

Overview of Changes

Added the --sentence-transformers-dense-modules conversion option to support including Sentence Transformers (ST) dense layers.
- Currently, this option applies only to EmbeddingGemma.
Dense layers are now added to the graph when they are present in the GGUF file.
The configuration of dense modules is now read as the first step toward full, generic dense-module support.

About Module Configuration

By reading the dense-module configuration, we lay the groundwork for full linear projection support.
At the moment, EmbeddingGemma dense layers only include weight, and activation is identity.
But ST dense layers can represent complete linear projections, including both biases and non-identity activation.

sfallah added 2 commits October 1, 2025 12:12

model: EmbeddingGemma sentence-transformers dense linear projections …

a0b83f6

…support

sfallah requested a review from CISC as a code owner October 1, 2025 11:12

github-actions bot added the python python script changes label Oct 1, 2025

CISC requested a review from danbev October 1, 2025 11:22

model: add support for EmbeddingGemma SentenceTransformers dense line…

f3be74e

…ar projections - converting model with dense-layers is optional - introduced dense config params

sfallah requested a review from ggerganov as a code owner October 4, 2025 06:58

Merge branch 'master' into embeddinggemma_sentence_transformers

5883eea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367

model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367

sfallah commented Oct 1, 2025

Uh oh!

danbev commented Oct 1, 2025 •

edited

Loading

Uh oh!

sfallah commented Oct 1, 2025

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

sfallah commented Oct 2, 2025

Uh oh!

danbev commented Oct 2, 2025

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

sfallah commented Oct 2, 2025

Uh oh!

sfallah commented Oct 4, 2025

Uh oh!

Uh oh!

model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367

Are you sure you want to change the base?

model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules #16367

Conversation

sfallah commented Oct 1, 2025

model: add support for EmbeddingGemma SentenceTransformers dense linear projections

Uh oh!

danbev commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfallah commented Oct 1, 2025

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

sfallah commented Oct 2, 2025

Uh oh!

danbev commented Oct 2, 2025

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

sfallah commented Oct 2, 2025

Uh oh!

sfallah commented Oct 4, 2025

Overview of Changes

About Module Configuration

Uh oh!

Uh oh!

danbev commented Oct 1, 2025 •

edited

Loading