feat: Add support for deepset-mxbai-embed-de-large-v1 German embedding model #561

aaronspring · 2025-10-18T06:07:13Z

Summary

Add support for mixedbread-ai/deepset-mxbai-embed-de-large-v1 German/English bilingual embedding model
Implement automatic query/passage prefix support for mxbai models
Add model description to fastembed/text/onnx_embedding.py
Add canonical test vectors to tests/test_text_onnx_embeddings.py

Model Details

Dimensions: 1024
Languages: German and English
License: Apache 2.0
Size: 1.94 GB
Model file: onnx/model.onnx
Source: https://huggingface.co/mixedbread-ai/deepset-mxbai-embed-de-large-v1

Key Features

Automatic Prefix Support

Both mxbai models now automatically apply the required prefixes when using query_embed() and passage_embed() methods:

mxbai-embed-large-v1:
- Query: "Represent this sentence for searching relevant passages: "
- Passage: "" (empty)
deepset-mxbai-embed-de-large-v1:
- Query: "query: "
- Passage: "passage: "

Prefix values are documented with direct links to HuggingFace model cards.

Usage Example

from fastembed import TextEmbedding

# German model
model = TextEmbedding("mixedbread-ai/deepset-mxbai-embed-de-large-v1")

# Prefixes are added automatically!
query_embeddings = list(model.query_embed(["Was ist FastEmbed?"]))
doc_embeddings = list(model.passage_embed(["FastEmbed ist eine Python-Bibliothek"]))

# English model
model = TextEmbedding("mixedbread-ai/mxbai-embed-large-v1")
query_embeddings = list(model.query_embed(["What is FastEmbed?"]))

Implementation Details

Prefix Configuration: Added tasks field to model descriptions with query_prefix and passage_prefix
Automatic Application: Overrode query_embed() and passage_embed() in OnnxTextEmbedding to automatically prepend prefixes
Backward Compatible: Falls back gracefully for models without prefix configuration
Extensible: Pattern can be applied to other models requiring prefixes

Test Plan

Generated canonical vectors using the reference implementation
Verified embeddings match canonical values within tolerance (atol=1e-3)
Model loads successfully
Embedding generation works for test documents
Verified automatic prefixes match manual prefixes exactly
Tested both mxbai models with automatic prefix functionality
CI tests will validate across all platforms

…g model Add mixedbread-ai/deepset-mxbai-embed-de-large-v1 model to the supported models list. This is a German/English bilingual embedding model with 1024 dimensions, optimized for semantic search tasks in German language. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

coderabbitai · 2025-10-18T06:09:42Z

📝 Walkthrough

Walkthrough

This PR adds two DenseModelDescription entries to supported_onnx_models for mixedbread-ai/mxbai-embed-large-v1 and mixedbread-ai/deepset-mxbai-embed-de-large-v1 (both 1024-dim, ONNX model files, token truncation, license and size metadata, and optional query/passage prefixes). It introduces OnnxTextEmbedding.query_embed and .passage_embed (and matching methods on OnnxTextEmbeddingWorker) which apply model-specific query_prefix or passage_prefix from model_description.tasks to inputs before delegating to the existing embedding flow. Tests were updated to include a canonical embedding vector for the deepset-mxbai model.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Changes are additive: metadata entries, two new public methods plus worker variants, and a test vector.
Review should focus on correctness of model metadata, prefix application logic, input normalization, and test alignment.

Suggested reviewers

hh-space-invader

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat: Add support for deepset-mxbai-embed-de-large-v1 German embedding model" is fully related to the main objective of the changeset, which is adding support for a new German/English bilingual embedding model. The title is concise, specific, and clearly communicates the primary change. While it doesn't capture all modifications (such as the new query_embed/passage_embed methods or the mxbai-embed-large-v1 addition), the criteria acknowledge that titles don't need comprehensive coverage of every detail.
Description Check	✅ Passed	The pull request description is clearly related to the changeset, providing specific details about what is being added (model support, model descriptions, test vectors), where they are being added (specific file paths), and relevant model details (dimensions, languages, license). The description is substantive and not vague, with concrete references to files, a test plan with verification steps, and a related PR for context. All elements of the description correspond directly to modifications shown in the raw_summary.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

fastembed/text/onnx_embedding.py (1)
111-122: Model entry looks correct; consider adding explicit prefixes if required.

Fields (dim=1024, license, size, model_file) look consistent. If the model expects query/doc prefixes, add them to tasks for clarity and future-proofing.

If applicable, add:
 DenseModelDescription(
     model="mixedbread-ai/deepset-mxbai-embed-de-large-v1",
     dim=1024,
+    tasks={
+        "embedding": {
+            # Update if the model’s recommended prompts differ
+            "query_prefix": "search_query: ",
+            "document_prefix": "search_document: ",
+        }
+    },
     description=(
         "Text embeddings, Unimodal (text), German/English, 512 input tokens truncation, "
         "Prefixes for queries/documents: necessary, 2024 year."
     ),
     license="apache-2.0",
     size_in_GB=1.94,
     sources=ModelSource(hf="mixedbread-ai/deepset-mxbai-embed-de-large-v1"),
     model_file="onnx/model.onnx",
 ),
Also sanity-check that the ONNX output is 2D [batch, 1024]; if 3D token-wise, our post-process takes [:, 0], which should align with the model’s recommended pooling.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba1f605 and 5512d96.

📒 Files selected for processing (2)

fastembed/text/onnx_embedding.py (1 hunks)
tests/test_text_onnx_embeddings.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

fastembed/text/onnx_embedding.py (1)

fastembed/common/model_description.py (2)

DenseModelDescription (35-40)

ModelSource (7-20)

🔇 Additional comments (1)

tests/test_text_onnx_embeddings.py (1)

60-62: Canonical vector added — LGTM.

Entry matches existing pattern (5 dims, used with np.allclose at atol=1e-3). Please confirm the vector was generated via this repo’s ONNX code path and normalization to avoid drift vs. PyTorch reference.

aaronspring · 2025-10-18T06:13:16Z

Thanks for the review!

Regarding the suggestion to add explicit query/doc prefixes in the tasks field:

After reviewing the codebase, I found that the tasks field is currently only used for multi-task models (like jinaai/jina-embeddings-v3) where different task IDs are mapped to specific embedding objectives. For regular embedding models, prefix information is documented in the description field rather than implemented as a feature.

The current implementation follows the same pattern as the existing mixedbread-ai/mxbai-embed-large-v1 model (lines 99-110), which also has "Prefixes for queries/documents: necessary" in the description but doesn't use the tasks field.

According to the HuggingFace model card, this model expects:

Query prefix: "query: {query}"
Document prefix: "passage: {doc}"

However, implementing these prefixes automatically would require extending the base functionality beyond the scope of this PR. For now, users need to add the prefixes manually to their input text, which is consistent with how other models like mxbai-embed-large-v1 are currently handled.

The ONNX output is verified to be 2D [batch, 1024] with CLS token pooling and normalization, matching the expected behavior.

Implement automatic prefix functionality for mixedbread-ai models: - Add query_prefix and passage_prefix configuration in model descriptions - Override query_embed() and passage_embed() in OnnxTextEmbedding to automatically prepend prefixes - mxbai-embed-large-v1: Uses "Represent this sentence for searching relevant passages: " for queries - deepset-mxbai-embed-de-large-v1: Uses "query: " for queries and "passage: " for documents This eliminates the need for users to manually add prefixes when using query_embed() and passage_embed() methods. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

aaronspring · 2025-10-18T07:21:22Z

Update: Implemented Automatic Prefix Support

I've implemented the automatic query/passage prefix functionality as suggested in the review! 🎉

Changes Made

Added prefix configuration to model descriptions using the tasks field:
- mxbai-embed-large-v1:
  - Query prefix: "Represent this sentence for searching relevant passages: "
  - Passage prefix: "" (empty)
- deepset-mxbai-embed-de-large-v1:
  - Query prefix: "query: "
  - Passage prefix: "passage: "
Implemented automatic prefix logic by overriding query_embed() and passage_embed() methods in OnnxTextEmbedding
- The methods automatically prepend the configured prefixes to input texts
- Falls back gracefully for models without prefix configuration

Usage

Users can now use these models without manually adding prefixes:

from fastembed import TextEmbedding

# German model
model = TextEmbedding("mixedbread-ai/deepset-mxbai-embed-de-large-v1")

# Prefixes are added automatically!
query_embeddings = list(model.query_embed(["Was ist FastEmbed?"]))
doc_embeddings = list(model.passage_embed(["FastEmbed ist eine Python-Bibliothek"]))

# English model
model = TextEmbedding("mixedbread-ai/mxbai-embed-large-v1")
query_embeddings = list(model.query_embed(["What is FastEmbed?"]))

Testing

✅ Verified automatic prefixes match manual prefixes exactly
✅ Canonical vectors still pass (embed() method unchanged)
✅ Works for both models

This approach is extensible and can be applied to other models in the future!

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

fastembed/text/onnx_embedding.py (1)

317-339: LGTM! Query prefix handling is correctly implemented.

The method properly applies model-specific query prefixes when present in model_description.tasks and safely delegates to the base embed() method. The type handling for both string and iterable inputs is correct, and the None-check for tasks provides good defensive programming.

Note: When a prefix is present, iterables are materialized into lists (line 335). This is acceptable for typical batch processing workflows, though it differs slightly from the base class behavior which preserves iterable types.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5512d96 and a05e0de.

📒 Files selected for processing (1)

fastembed/text/onnx_embedding.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

fastembed/text/onnx_embedding.py (2)

fastembed/common/model_description.py (1)

DenseModelDescription (35-40)

fastembed/text/text_embedding_base.py (3)

query_embed (46-61)

embed (22-29)

passage_embed (31-44)

🔇 Additional comments (3)

fastembed/text/onnx_embedding.py (3)

110-114: LGTM! Tasks field enables automatic prefix handling.

The addition of the tasks field with explicit query/passage prefixes enables the new query_embed() and passage_embed() methods to automatically apply the correct prefixes for this model. The query prefix follows the model's recommended usage pattern.

115-129: LGTM! New German/English bilingual embedding model looks correct.

The model metadata aligns with the PR objectives and HuggingFace model card. The query/passage prefixes match the expected format ("query: " and "passage: "), and all required fields are properly populated.

341-359: LGTM! Passage prefix handling follows the same pattern as query_embed.

The implementation correctly applies model-specific passage prefixes and safely delegates to embed(). The None-check for tasks ensures robustness, and the logic is straightforward.

Add comment links to HuggingFace model cards documenting the source of the query and passage prefix strings for both mxbai models. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

fastembed/text/onnx_embedding.py (2)

319-341: Enhance docstring to document automatic prefix behavior.

The method correctly implements automatic prefix application, but the docstring doesn't inform users that prefixes are added automatically based on model_description.tasks. Users may not realize their queries are being modified.

Apply this diff to enhance the docstring:

     def query_embed(self, query: Union[str, Iterable[str]], **kwargs: Any) -> Iterable[NumpyArray]:
         """
-        Embeds queries with optional query prefix.
+        Embeds queries with optional automatic query prefix.
+        
+        If the model has a query_prefix defined in its tasks configuration,
+        it will be automatically prepended to each query.

         Args:
             query (Union[str, Iterable[str]]): The query to embed, or an iterable e.g. list of queries.
+            **kwargs: Additional keyword arguments to pass to the embed method.

         Returns:
             Iterable[NumpyArray]: The embeddings.
         """

343-361: Enhance docstring to document automatic prefix behavior.

Similar to query_embed, the docstring should inform users that prefixes are automatically applied when configured in the model's tasks.

Apply this diff to enhance the docstring:

     def passage_embed(self, texts: Iterable[str], **kwargs: Any) -> Iterable[NumpyArray]:
         """
-        Embeds passages with optional passage prefix.
+        Embeds passages with optional automatic passage prefix.
+        
+        If the model has a passage_prefix defined in its tasks configuration,
+        it will be automatically prepended to each passage.

         Args:
             texts (Iterable[str]): The list of texts to embed.
             **kwargs: Additional keyword arguments to pass to the embed method.

         Yields:
             Iterable[NumpyArray]: The embeddings.
         """

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a05e0de and 851b902.

📒 Files selected for processing (1)

fastembed/text/onnx_embedding.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

fastembed/text/onnx_embedding.py (3)

fastembed/common/model_description.py (1)

DenseModelDescription (35-40)

fastembed/text/text_embedding_base.py (3)

query_embed (46-61)

embed (22-29)

passage_embed (31-44)

fastembed/text/text_embedding.py (3)

query_embed (189-200)

embed (165-187)

passage_embed (202-214)

🔇 Additional comments (1)

fastembed/text/onnx_embedding.py (1)

100-132: Breaking change and model inconsistency concerns verified; code implementation is correct.

The method implementations for query_embed() and passage_embed() are correct and properly handle automatic prefix application with appropriate checks for empty strings. Docstrings document the prefix behavior.

However, the original concerns remain valid:

Breaking change: The modification to mxbai-embed-large-v1 adds automatic prefix support where it didn't exist before. Users currently calling these methods with manually added prefixes will now receive double-prefixed inputs, degrading embedding quality.

API inconsistency: Six other models explicitly marked "Prefixes for queries/documents: necessary" (BAAI/bge-base-en, BAAI/bge-small-en, snowflake-arctic-embed-m, snowflake-arctic-embed-m-long, GritLM/GritLM-7B, all-MiniLM models) lack the tasks field for automatic prefixing. Only the two models added/modified in this PR have automatic prefix support, creating inconsistent behavior across similar models.

Required actions:

Document this as a breaking change in release notes with migration guidance

Decide whether to extend automatic prefix support to other models marked "necessary" for API consistency, or document why only these two models have this feature

Consider version compatibility: should users on existing versions get an automatic migration path?

Add support for jinaai/jina-embeddings-v3, a multilingual embedding model with 1024 dimensions supporting 89+ languages and task-specific LoRA adapters. Features: - Task-specific embeddings via LoRA adapters (retrieval.query, retrieval.passage, classification, text-matching, separation) - Automatic task_id handling for ONNX inference - Default to text-matching task for general purpose use - query_embed() and passage_embed() methods for retrieval tasks - Matryoshka dimensions support (32-1024) - 8,192 token context window Model specs: - 570M parameters - 2.29 GB ONNX model - Apache 2.0 license Implementation: - Added model configuration with additional_files for model.onnx_data - Load lora_adaptations from config.json - Preprocess ONNX input to add task_id parameter - Override query_embed/passage_embed for automatic task selection - Added comprehensive multi-task test with canonical vectors Following the pattern from PR qdrant#561 but using task_id instead of text prefixes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

coderabbitai bot reviewed Oct 18, 2025

View reviewed changes

aaronspring mentioned this pull request Oct 20, 2025

Add Snowflake Arctic Embed L v2.0 model #562

Open

4 tasks

aaronspring mentioned this pull request Oct 20, 2025

feat: Add Jina Embeddings v3 with task-specific LoRA support #563

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add support for deepset-mxbai-embed-de-large-v1 German embedding model #561

feat: Add support for deepset-mxbai-embed-de-large-v1 German embedding model #561

Uh oh!

aaronspring commented Oct 18, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 18, 2025 •

edited

Loading

Walkthrough

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

aaronspring commented Oct 18, 2025

Uh oh!

aaronspring commented Oct 18, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add support for deepset-mxbai-embed-de-large-v1 German embedding model #561

Are you sure you want to change the base?

feat: Add support for deepset-mxbai-embed-de-large-v1 German embedding model #561

Uh oh!

Conversation

aaronspring commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Model Details

Key Features

Automatic Prefix Support

Usage Example

Implementation Details

Test Plan

Related

Uh oh!

coderabbitai bot commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

aaronspring commented Oct 18, 2025

Uh oh!

aaronspring commented Oct 18, 2025

Update: Implemented Automatic Prefix Support

Changes Made

Usage

Testing

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aaronspring commented Oct 18, 2025 •

edited

Loading

coderabbitai bot commented Oct 18, 2025 •

edited

Loading