Skip to content

Conversation

@aaronspring
Copy link

@aaronspring aaronspring commented Oct 18, 2025

Summary

  • Add support for mixedbread-ai/deepset-mxbai-embed-de-large-v1 German/English bilingual embedding model
  • Implement automatic query/passage prefix support for mxbai models
  • Add model description to fastembed/text/onnx_embedding.py
  • Add canonical test vectors to tests/test_text_onnx_embeddings.py

Model Details

Key Features

Automatic Prefix Support

Both mxbai models now automatically apply the required prefixes when using query_embed() and passage_embed() methods:

  • mxbai-embed-large-v1:

    • Query: "Represent this sentence for searching relevant passages: "
    • Passage: "" (empty)
  • deepset-mxbai-embed-de-large-v1:

    • Query: "query: "
    • Passage: "passage: "

Prefix values are documented with direct links to HuggingFace model cards.

Usage Example

from fastembed import TextEmbedding

# German model
model = TextEmbedding("mixedbread-ai/deepset-mxbai-embed-de-large-v1")

# Prefixes are added automatically!
query_embeddings = list(model.query_embed(["Was ist FastEmbed?"]))
doc_embeddings = list(model.passage_embed(["FastEmbed ist eine Python-Bibliothek"]))

# English model
model = TextEmbedding("mixedbread-ai/mxbai-embed-large-v1")
query_embeddings = list(model.query_embed(["What is FastEmbed?"]))

Implementation Details

  1. Prefix Configuration: Added tasks field to model descriptions with query_prefix and passage_prefix
  2. Automatic Application: Overrode query_embed() and passage_embed() in OnnxTextEmbedding to automatically prepend prefixes
  3. Backward Compatible: Falls back gracefully for models without prefix configuration
  4. Extensible: Pattern can be applied to other models requiring prefixes

Test Plan

  • Generated canonical vectors using the reference implementation
  • Verified embeddings match canonical values within tolerance (atol=1e-3)
  • Model loads successfully
  • Embedding generation works for test documents
  • Verified automatic prefixes match manual prefixes exactly
  • Tested both mxbai models with automatic prefix functionality
  • CI tests will validate across all platforms

Related

Follows the same pattern as #270

🤖 Generated with Claude Code

…g model

Add mixedbread-ai/deepset-mxbai-embed-de-large-v1 model to the supported models list. This is a German/English bilingual embedding model with 1024 dimensions, optimized for semantic search tasks in German language.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@coderabbitai
Copy link

coderabbitai bot commented Oct 18, 2025

📝 Walkthrough

Walkthrough

This PR adds two DenseModelDescription entries to supported_onnx_models for mixedbread-ai/mxbai-embed-large-v1 and mixedbread-ai/deepset-mxbai-embed-de-large-v1 (both 1024-dim, ONNX model files, token truncation, license and size metadata, and optional query/passage prefixes). It introduces OnnxTextEmbedding.query_embed and .passage_embed (and matching methods on OnnxTextEmbeddingWorker) which apply model-specific query_prefix or passage_prefix from model_description.tasks to inputs before delegating to the existing embedding flow. Tests were updated to include a canonical embedding vector for the deepset-mxbai model.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Changes are additive: metadata entries, two new public methods plus worker variants, and a test vector.
  • Review should focus on correctness of model metadata, prefix application logic, input normalization, and test alignment.

Suggested reviewers

  • hh-space-invader

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: Add support for deepset-mxbai-embed-de-large-v1 German embedding model" is fully related to the main objective of the changeset, which is adding support for a new German/English bilingual embedding model. The title is concise, specific, and clearly communicates the primary change. While it doesn't capture all modifications (such as the new query_embed/passage_embed methods or the mxbai-embed-large-v1 addition), the criteria acknowledge that titles don't need comprehensive coverage of every detail.
Description Check ✅ Passed The pull request description is clearly related to the changeset, providing specific details about what is being added (model support, model descriptions, test vectors), where they are being added (specific file paths), and relevant model details (dimensions, languages, license). The description is substantive and not vague, with concrete references to files, a test plan with verification steps, and a related PR for context. All elements of the description correspond directly to modifications shown in the raw_summary.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fastembed/text/onnx_embedding.py (1)

111-122: Model entry looks correct; consider adding explicit prefixes if required.

Fields (dim=1024, license, size, model_file) look consistent. If the model expects query/doc prefixes, add them to tasks for clarity and future-proofing.

If applicable, add:

 DenseModelDescription(
     model="mixedbread-ai/deepset-mxbai-embed-de-large-v1",
     dim=1024,
+    tasks={
+        "embedding": {
+            # Update if the model’s recommended prompts differ
+            "query_prefix": "search_query: ",
+            "document_prefix": "search_document: ",
+        }
+    },
     description=(
         "Text embeddings, Unimodal (text), German/English, 512 input tokens truncation, "
         "Prefixes for queries/documents: necessary, 2024 year."
     ),
     license="apache-2.0",
     size_in_GB=1.94,
     sources=ModelSource(hf="mixedbread-ai/deepset-mxbai-embed-de-large-v1"),
     model_file="onnx/model.onnx",
 ),

Also sanity-check that the ONNX output is 2D [batch, 1024]; if 3D token-wise, our post-process takes [:, 0], which should align with the model’s recommended pooling.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba1f605 and 5512d96.

📒 Files selected for processing (2)
  • fastembed/text/onnx_embedding.py (1 hunks)
  • tests/test_text_onnx_embeddings.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/text/onnx_embedding.py (1)
fastembed/common/model_description.py (2)
  • DenseModelDescription (35-40)
  • ModelSource (7-20)
🔇 Additional comments (1)
tests/test_text_onnx_embeddings.py (1)

60-62: Canonical vector added — LGTM.

Entry matches existing pattern (5 dims, used with np.allclose at atol=1e-3). Please confirm the vector was generated via this repo’s ONNX code path and normalization to avoid drift vs. PyTorch reference.

@aaronspring
Copy link
Author

Thanks for the review!

Regarding the suggestion to add explicit query/doc prefixes in the tasks field:

After reviewing the codebase, I found that the tasks field is currently only used for multi-task models (like jinaai/jina-embeddings-v3) where different task IDs are mapped to specific embedding objectives. For regular embedding models, prefix information is documented in the description field rather than implemented as a feature.

The current implementation follows the same pattern as the existing mixedbread-ai/mxbai-embed-large-v1 model (lines 99-110), which also has "Prefixes for queries/documents: necessary" in the description but doesn't use the tasks field.

According to the HuggingFace model card, this model expects:

  • Query prefix: "query: {query}"
  • Document prefix: "passage: {doc}"

However, implementing these prefixes automatically would require extending the base functionality beyond the scope of this PR. For now, users need to add the prefixes manually to their input text, which is consistent with how other models like mxbai-embed-large-v1 are currently handled.

The ONNX output is verified to be 2D [batch, 1024] with CLS token pooling and normalization, matching the expected behavior.

Implement automatic prefix functionality for mixedbread-ai models:
- Add query_prefix and passage_prefix configuration in model descriptions
- Override query_embed() and passage_embed() in OnnxTextEmbedding to automatically prepend prefixes
- mxbai-embed-large-v1: Uses "Represent this sentence for searching relevant passages: " for queries
- deepset-mxbai-embed-de-large-v1: Uses "query: " for queries and "passage: " for documents

This eliminates the need for users to manually add prefixes when using query_embed() and passage_embed() methods.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@aaronspring
Copy link
Author

Update: Implemented Automatic Prefix Support

I've implemented the automatic query/passage prefix functionality as suggested in the review! 🎉

Changes Made

  1. Added prefix configuration to model descriptions using the tasks field:

    • mxbai-embed-large-v1:
      • Query prefix: "Represent this sentence for searching relevant passages: "
      • Passage prefix: "" (empty)
    • deepset-mxbai-embed-de-large-v1:
      • Query prefix: "query: "
      • Passage prefix: "passage: "
  2. Implemented automatic prefix logic by overriding query_embed() and passage_embed() methods in OnnxTextEmbedding

    • The methods automatically prepend the configured prefixes to input texts
    • Falls back gracefully for models without prefix configuration

Usage

Users can now use these models without manually adding prefixes:

from fastembed import TextEmbedding

# German model
model = TextEmbedding("mixedbread-ai/deepset-mxbai-embed-de-large-v1")

# Prefixes are added automatically!
query_embeddings = list(model.query_embed(["Was ist FastEmbed?"]))
doc_embeddings = list(model.passage_embed(["FastEmbed ist eine Python-Bibliothek"]))

# English model
model = TextEmbedding("mixedbread-ai/mxbai-embed-large-v1")
query_embeddings = list(model.query_embed(["What is FastEmbed?"]))

Testing

  • ✅ Verified automatic prefixes match manual prefixes exactly
  • ✅ Canonical vectors still pass (embed() method unchanged)
  • ✅ Works for both models

This approach is extensible and can be applied to other models in the future!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fastembed/text/onnx_embedding.py (1)

317-339: LGTM! Query prefix handling is correctly implemented.

The method properly applies model-specific query prefixes when present in model_description.tasks and safely delegates to the base embed() method. The type handling for both string and iterable inputs is correct, and the None-check for tasks provides good defensive programming.

Note: When a prefix is present, iterables are materialized into lists (line 335). This is acceptable for typical batch processing workflows, though it differs slightly from the base class behavior which preserves iterable types.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5512d96 and a05e0de.

📒 Files selected for processing (1)
  • fastembed/text/onnx_embedding.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/text/onnx_embedding.py (2)
fastembed/common/model_description.py (1)
  • DenseModelDescription (35-40)
fastembed/text/text_embedding_base.py (3)
  • query_embed (46-61)
  • embed (22-29)
  • passage_embed (31-44)
🔇 Additional comments (3)
fastembed/text/onnx_embedding.py (3)

110-114: LGTM! Tasks field enables automatic prefix handling.

The addition of the tasks field with explicit query/passage prefixes enables the new query_embed() and passage_embed() methods to automatically apply the correct prefixes for this model. The query prefix follows the model's recommended usage pattern.


115-129: LGTM! New German/English bilingual embedding model looks correct.

The model metadata aligns with the PR objectives and HuggingFace model card. The query/passage prefixes match the expected format ("query: " and "passage: "), and all required fields are properly populated.


341-359: LGTM! Passage prefix handling follows the same pattern as query_embed.

The implementation correctly applies model-specific passage prefixes and safely delegates to embed(). The None-check for tasks ensures robustness, and the logic is straightforward.

Add comment links to HuggingFace model cards documenting the source of the query and passage prefix strings for both mxbai models.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
fastembed/text/onnx_embedding.py (2)

319-341: Enhance docstring to document automatic prefix behavior.

The method correctly implements automatic prefix application, but the docstring doesn't inform users that prefixes are added automatically based on model_description.tasks. Users may not realize their queries are being modified.

Apply this diff to enhance the docstring:

     def query_embed(self, query: Union[str, Iterable[str]], **kwargs: Any) -> Iterable[NumpyArray]:
         """
-        Embeds queries with optional query prefix.
+        Embeds queries with optional automatic query prefix.
+        
+        If the model has a query_prefix defined in its tasks configuration,
+        it will be automatically prepended to each query.

         Args:
             query (Union[str, Iterable[str]]): The query to embed, or an iterable e.g. list of queries.
+            **kwargs: Additional keyword arguments to pass to the embed method.

         Returns:
             Iterable[NumpyArray]: The embeddings.
         """

343-361: Enhance docstring to document automatic prefix behavior.

Similar to query_embed, the docstring should inform users that prefixes are automatically applied when configured in the model's tasks.

Apply this diff to enhance the docstring:

     def passage_embed(self, texts: Iterable[str], **kwargs: Any) -> Iterable[NumpyArray]:
         """
-        Embeds passages with optional passage prefix.
+        Embeds passages with optional automatic passage prefix.
+        
+        If the model has a passage_prefix defined in its tasks configuration,
+        it will be automatically prepended to each passage.

         Args:
             texts (Iterable[str]): The list of texts to embed.
             **kwargs: Additional keyword arguments to pass to the embed method.

         Yields:
             Iterable[NumpyArray]: The embeddings.
         """
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a05e0de and 851b902.

📒 Files selected for processing (1)
  • fastembed/text/onnx_embedding.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/text/onnx_embedding.py (3)
fastembed/common/model_description.py (1)
  • DenseModelDescription (35-40)
fastembed/text/text_embedding_base.py (3)
  • query_embed (46-61)
  • embed (22-29)
  • passage_embed (31-44)
fastembed/text/text_embedding.py (3)
  • query_embed (189-200)
  • embed (165-187)
  • passage_embed (202-214)
🔇 Additional comments (1)
fastembed/text/onnx_embedding.py (1)

100-132: Breaking change and model inconsistency concerns verified; code implementation is correct.

The method implementations for query_embed() and passage_embed() are correct and properly handle automatic prefix application with appropriate checks for empty strings. Docstrings document the prefix behavior.

However, the original concerns remain valid:

  1. Breaking change: The modification to mxbai-embed-large-v1 adds automatic prefix support where it didn't exist before. Users currently calling these methods with manually added prefixes will now receive double-prefixed inputs, degrading embedding quality.

  2. API inconsistency: Six other models explicitly marked "Prefixes for queries/documents: necessary" (BAAI/bge-base-en, BAAI/bge-small-en, snowflake-arctic-embed-m, snowflake-arctic-embed-m-long, GritLM/GritLM-7B, all-MiniLM models) lack the tasks field for automatic prefixing. Only the two models added/modified in this PR have automatic prefix support, creating inconsistent behavior across similar models.

Required actions:

  1. Document this as a breaking change in release notes with migration guidance
  2. Decide whether to extend automatic prefix support to other models marked "necessary" for API consistency, or document why only these two models have this feature
  3. Consider version compatibility: should users on existing versions get an automatic migration path?

aaronspring added a commit to aaronspring/fastembed that referenced this pull request Oct 20, 2025
Add support for jinaai/jina-embeddings-v3, a multilingual embedding model
with 1024 dimensions supporting 89+ languages and task-specific LoRA adapters.

Features:
- Task-specific embeddings via LoRA adapters (retrieval.query, retrieval.passage,
  classification, text-matching, separation)
- Automatic task_id handling for ONNX inference
- Default to text-matching task for general purpose use
- query_embed() and passage_embed() methods for retrieval tasks
- Matryoshka dimensions support (32-1024)
- 8,192 token context window

Model specs:
- 570M parameters
- 2.29 GB ONNX model
- Apache 2.0 license

Implementation:
- Added model configuration with additional_files for model.onnx_data
- Load lora_adaptations from config.json
- Preprocess ONNX input to add task_id parameter
- Override query_embed/passage_embed for automatic task selection
- Added comprehensive multi-task test with canonical vectors

Following the pattern from PR qdrant#561 but using task_id instead of text prefixes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant