-
Notifications
You must be signed in to change notification settings - Fork 165
feat: Add support for deepset-mxbai-embed-de-large-v1 German embedding model #561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add support for deepset-mxbai-embed-de-large-v1 German embedding model #561
Conversation
…g model Add mixedbread-ai/deepset-mxbai-embed-de-large-v1 model to the supported models list. This is a German/English bilingual embedding model with 1024 dimensions, optimized for semantic search tasks in German language. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
📝 WalkthroughWalkthroughThis PR adds two DenseModelDescription entries to supported_onnx_models for mixedbread-ai/mxbai-embed-large-v1 and mixedbread-ai/deepset-mxbai-embed-de-large-v1 (both 1024-dim, ONNX model files, token truncation, license and size metadata, and optional query/passage prefixes). It introduces OnnxTextEmbedding.query_embed and .passage_embed (and matching methods on OnnxTextEmbeddingWorker) which apply model-specific query_prefix or passage_prefix from model_description.tasks to inputs before delegating to the existing embedding flow. Tests were updated to include a canonical embedding vector for the deepset-mxbai model. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fastembed/text/onnx_embedding.py (1)
111-122: Model entry looks correct; consider adding explicit prefixes if required.Fields (dim=1024, license, size, model_file) look consistent. If the model expects query/doc prefixes, add them to tasks for clarity and future-proofing.
If applicable, add:
DenseModelDescription( model="mixedbread-ai/deepset-mxbai-embed-de-large-v1", dim=1024, + tasks={ + "embedding": { + # Update if the model’s recommended prompts differ + "query_prefix": "search_query: ", + "document_prefix": "search_document: ", + } + }, description=( "Text embeddings, Unimodal (text), German/English, 512 input tokens truncation, " "Prefixes for queries/documents: necessary, 2024 year." ), license="apache-2.0", size_in_GB=1.94, sources=ModelSource(hf="mixedbread-ai/deepset-mxbai-embed-de-large-v1"), model_file="onnx/model.onnx", ),Also sanity-check that the ONNX output is 2D [batch, 1024]; if 3D token-wise, our post-process takes [:, 0], which should align with the model’s recommended pooling.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fastembed/text/onnx_embedding.py(1 hunks)tests/test_text_onnx_embeddings.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/text/onnx_embedding.py (1)
fastembed/common/model_description.py (2)
DenseModelDescription(35-40)ModelSource(7-20)
🔇 Additional comments (1)
tests/test_text_onnx_embeddings.py (1)
60-62: Canonical vector added — LGTM.Entry matches existing pattern (5 dims, used with np.allclose at atol=1e-3). Please confirm the vector was generated via this repo’s ONNX code path and normalization to avoid drift vs. PyTorch reference.
|
Thanks for the review! Regarding the suggestion to add explicit query/doc prefixes in the After reviewing the codebase, I found that the The current implementation follows the same pattern as the existing According to the HuggingFace model card, this model expects:
However, implementing these prefixes automatically would require extending the base functionality beyond the scope of this PR. For now, users need to add the prefixes manually to their input text, which is consistent with how other models like The ONNX output is verified to be 2D [batch, 1024] with CLS token pooling and normalization, matching the expected behavior. |
Implement automatic prefix functionality for mixedbread-ai models: - Add query_prefix and passage_prefix configuration in model descriptions - Override query_embed() and passage_embed() in OnnxTextEmbedding to automatically prepend prefixes - mxbai-embed-large-v1: Uses "Represent this sentence for searching relevant passages: " for queries - deepset-mxbai-embed-de-large-v1: Uses "query: " for queries and "passage: " for documents This eliminates the need for users to manually add prefixes when using query_embed() and passage_embed() methods. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Update: Implemented Automatic Prefix SupportI've implemented the automatic query/passage prefix functionality as suggested in the review! 🎉 Changes Made
UsageUsers can now use these models without manually adding prefixes: from fastembed import TextEmbedding
# German model
model = TextEmbedding("mixedbread-ai/deepset-mxbai-embed-de-large-v1")
# Prefixes are added automatically!
query_embeddings = list(model.query_embed(["Was ist FastEmbed?"]))
doc_embeddings = list(model.passage_embed(["FastEmbed ist eine Python-Bibliothek"]))
# English model
model = TextEmbedding("mixedbread-ai/mxbai-embed-large-v1")
query_embeddings = list(model.query_embed(["What is FastEmbed?"]))Testing
This approach is extensible and can be applied to other models in the future! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fastembed/text/onnx_embedding.py (1)
317-339: LGTM! Query prefix handling is correctly implemented.The method properly applies model-specific query prefixes when present in
model_description.tasksand safely delegates to the baseembed()method. The type handling for both string and iterable inputs is correct, and the None-check fortasksprovides good defensive programming.Note: When a prefix is present, iterables are materialized into lists (line 335). This is acceptable for typical batch processing workflows, though it differs slightly from the base class behavior which preserves iterable types.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fastembed/text/onnx_embedding.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/text/onnx_embedding.py (2)
fastembed/common/model_description.py (1)
DenseModelDescription(35-40)fastembed/text/text_embedding_base.py (3)
query_embed(46-61)embed(22-29)passage_embed(31-44)
🔇 Additional comments (3)
fastembed/text/onnx_embedding.py (3)
110-114: LGTM! Tasks field enables automatic prefix handling.The addition of the
tasksfield with explicit query/passage prefixes enables the newquery_embed()andpassage_embed()methods to automatically apply the correct prefixes for this model. The query prefix follows the model's recommended usage pattern.
115-129: LGTM! New German/English bilingual embedding model looks correct.The model metadata aligns with the PR objectives and HuggingFace model card. The query/passage prefixes match the expected format ("query: " and "passage: "), and all required fields are properly populated.
341-359: LGTM! Passage prefix handling follows the same pattern as query_embed.The implementation correctly applies model-specific passage prefixes and safely delegates to
embed(). The None-check fortasksensures robustness, and the logic is straightforward.
Add comment links to HuggingFace model cards documenting the source of the query and passage prefix strings for both mxbai models. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
fastembed/text/onnx_embedding.py (2)
319-341: Enhance docstring to document automatic prefix behavior.The method correctly implements automatic prefix application, but the docstring doesn't inform users that prefixes are added automatically based on
model_description.tasks. Users may not realize their queries are being modified.Apply this diff to enhance the docstring:
def query_embed(self, query: Union[str, Iterable[str]], **kwargs: Any) -> Iterable[NumpyArray]: """ - Embeds queries with optional query prefix. + Embeds queries with optional automatic query prefix. + + If the model has a query_prefix defined in its tasks configuration, + it will be automatically prepended to each query. Args: query (Union[str, Iterable[str]]): The query to embed, or an iterable e.g. list of queries. + **kwargs: Additional keyword arguments to pass to the embed method. Returns: Iterable[NumpyArray]: The embeddings. """
343-361: Enhance docstring to document automatic prefix behavior.Similar to
query_embed, the docstring should inform users that prefixes are automatically applied when configured in the model's tasks.Apply this diff to enhance the docstring:
def passage_embed(self, texts: Iterable[str], **kwargs: Any) -> Iterable[NumpyArray]: """ - Embeds passages with optional passage prefix. + Embeds passages with optional automatic passage prefix. + + If the model has a passage_prefix defined in its tasks configuration, + it will be automatically prepended to each passage. Args: texts (Iterable[str]): The list of texts to embed. **kwargs: Additional keyword arguments to pass to the embed method. Yields: Iterable[NumpyArray]: The embeddings. """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fastembed/text/onnx_embedding.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/text/onnx_embedding.py (3)
fastembed/common/model_description.py (1)
DenseModelDescription(35-40)fastembed/text/text_embedding_base.py (3)
query_embed(46-61)embed(22-29)passage_embed(31-44)fastembed/text/text_embedding.py (3)
query_embed(189-200)embed(165-187)passage_embed(202-214)
🔇 Additional comments (1)
fastembed/text/onnx_embedding.py (1)
100-132: Breaking change and model inconsistency concerns verified; code implementation is correct.The method implementations for
query_embed()andpassage_embed()are correct and properly handle automatic prefix application with appropriate checks for empty strings. Docstrings document the prefix behavior.However, the original concerns remain valid:
Breaking change: The modification to
mxbai-embed-large-v1adds automatic prefix support where it didn't exist before. Users currently calling these methods with manually added prefixes will now receive double-prefixed inputs, degrading embedding quality.API inconsistency: Six other models explicitly marked "Prefixes for queries/documents: necessary" (BAAI/bge-base-en, BAAI/bge-small-en, snowflake-arctic-embed-m, snowflake-arctic-embed-m-long, GritLM/GritLM-7B, all-MiniLM models) lack the
tasksfield for automatic prefixing. Only the two models added/modified in this PR have automatic prefix support, creating inconsistent behavior across similar models.Required actions:
- Document this as a breaking change in release notes with migration guidance
- Decide whether to extend automatic prefix support to other models marked "necessary" for API consistency, or document why only these two models have this feature
- Consider version compatibility: should users on existing versions get an automatic migration path?
Add support for jinaai/jina-embeddings-v3, a multilingual embedding model with 1024 dimensions supporting 89+ languages and task-specific LoRA adapters. Features: - Task-specific embeddings via LoRA adapters (retrieval.query, retrieval.passage, classification, text-matching, separation) - Automatic task_id handling for ONNX inference - Default to text-matching task for general purpose use - query_embed() and passage_embed() methods for retrieval tasks - Matryoshka dimensions support (32-1024) - 8,192 token context window Model specs: - 570M parameters - 2.29 GB ONNX model - Apache 2.0 license Implementation: - Added model configuration with additional_files for model.onnx_data - Load lora_adaptations from config.json - Preprocess ONNX input to add task_id parameter - Override query_embed/passage_embed for automatic task selection - Added comprehensive multi-task test with canonical vectors Following the pattern from PR qdrant#561 but using task_id instead of text prefixes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Summary
mixedbread-ai/deepset-mxbai-embed-de-large-v1German/English bilingual embedding modelfastembed/text/onnx_embedding.pytests/test_text_onnx_embeddings.pyModel Details
onnx/model.onnxKey Features
Automatic Prefix Support
Both mxbai models now automatically apply the required prefixes when using
query_embed()andpassage_embed()methods:mxbai-embed-large-v1:
"Represent this sentence for searching relevant passages: """(empty)deepset-mxbai-embed-de-large-v1:
"query: ""passage: "Prefix values are documented with direct links to HuggingFace model cards.
Usage Example
Implementation Details
tasksfield to model descriptions withquery_prefixandpassage_prefixquery_embed()andpassage_embed()inOnnxTextEmbeddingto automatically prepend prefixesTest Plan
Related
Follows the same pattern as #270
🤖 Generated with Claude Code