Add Snowflake Arctic Embed L v2.0 model #562

aaronspring · 2025-10-20T18:43:20Z

Summary

Add support for Snowflake/snowflake-arctic-embed-l-v2.0, a multilingual embedding model that improves upon the original Arctic Embed L model with expanded language support and longer context.

Model Features

Dimensions: 1024
Languages: 74 languages (multilingual via XLM-RoBERTa)
Context Length: 8192 tokens (vs 512 in v1)
Architecture: Based on BAAI/bge-m3-retromae (XLM-RoBERTa)
Special Capabilities:
- Supports Matryoshka learning for dimension truncation to 256 dims
- Compatible with 4-bit quantization
- Query prefix recommended: query:
License: Apache 2.0
Model Size: 2.27 GB (ONNX format with separate data file)

Changes

Added model configuration to supported_onnx_models list in fastembed/text/onnx_embedding.py
Added canonical test vectors to tests/test_text_onnx_embeddings.py
Model includes both onnx/model.onnx and onnx/model.onnx_data files

Test Plan

Model loads successfully from HuggingFace
Embeddings generated with expected shape (batch_size, 1024)
Canonical vectors match within tolerance (atol=1e-3)
Model appears in list of supported models

References

HuggingFace: https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0
Similar PR: feat: Add support for deepset-mxbai-embed-de-large-v1 German embedding model #561

🤖 Generated with Claude Code

Add support for Snowflake/snowflake-arctic-embed-l-v2.0, a multilingual embedding model with the following features: - 1024 dimensions - 74 languages support - 8192 token context length - Based on XLM-RoBERTa architecture - Supports Matryoshka learning for dimension truncation - Apache 2.0 license Changes: - Added model configuration to supported_onnx_models list - Added canonical test vectors for validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

coderabbitai · 2025-10-20T18:45:58Z

Warning

Rate limit exceeded

@aaronspring has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 32 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 730634a and 7f41fba.

📒 Files selected for processing (2)

fastembed/text/onnx_embedding.py (2 hunks)
tests/test_text_onnx_embeddings.py (2 hunks)

📝 Walkthrough

Walkthrough

A new DenseModelDescription for Snowflake/snowflake-arctic-embed-l-v2.0 (dim=1024, embedding task with query_prefix "query: " and passage_prefix "", ONNX file paths, license, size, and source) was added to supported_onnx_models in fastembed/text/onnx_embedding.py. Two public methods were introduced to OnnxTextEmbedding: query_embed(self, query: Union[str, Iterable[str]], **kwargs) and passage_embed(self, texts: Iterable[str], **kwargs), which apply model-specific query/passage prefixes when present and delegate to the existing embed logic. A canonical 5-element embedding vector for the new model was added to tests/test_text_onnx_embeddings.py.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "Add Snowflake Arctic Embed L v2.0 model" directly and clearly describes the primary change in the changeset. According to the raw summary, the main modifications are adding a new model configuration for Snowflake/snowflake-arctic-embed-l-v2.0 to the supported models list, along with supporting methods and test vectors. The title accurately captures this core change with concise and specific language, making it clear to reviewers what the PR accomplishes without needing to list secondary changes like the new methods or test updates.
Description Check	✅ Passed	The PR description is well-related to the changeset and provides meaningful information. It explains the model being added, details its key features (dimensions, language support, context length, architecture), documents the specific code changes made, includes a verification test plan with checkmarks, and provides references. The description is comprehensive and directly corresponds to the changes summarized in the raw summary, including the model configuration addition and test vectors, making it far more than just related but actually thorough and helpful for understanding the PR's purpose and scope.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Implement automatic prefix handling for models with task-specific prefixes: - Added tasks field to model configuration with query_prefix and passage_prefix - Implemented query_embed() method to automatically prepend "query: " prefix - Implemented passage_embed() method (no prefix for this model) - Both methods check for tasks configuration and apply prefixes dynamically This enables optimal retrieval performance as recommended in the model documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

fastembed/text/onnx_embedding.py (1)
331-333: Simplify the condition by removing redundant hasattr check.

Since tasks is defined in DenseModelDescription with a default factory (see relevant code snippets), hasattr(self.model_description, "tasks") will always return True. The check self.model_description.tasks alone is sufficient to determine if tasks are configured.

Apply this diff:
-        if hasattr(self.model_description, "tasks") and self.model_description.tasks:
+        if self.model_description.tasks:
The same simplification applies to line 360 in passage_embed.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6f8bc1 and 730634a.

📒 Files selected for processing (1)

fastembed/text/onnx_embedding.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

fastembed/text/onnx_embedding.py (2)

fastembed/common/model_description.py (1)

DenseModelDescription (35-40)

fastembed/text/text_embedding_base.py (3)

query_embed (46-61)

embed (22-29)

passage_embed (31-44)

🔇 Additional comments (3)

fastembed/text/onnx_embedding.py (3)

317-346: Implementation correctly handles prefix application and input types.

The method properly:

Applies the query prefix when configured in the model's tasks

Handles both single string and iterable inputs correctly

Maintains backward compatibility for models without task-specific prefixes

Delegates to the existing embed method appropriately

348-369: Implementation correctly handles passage prefix application.

The method properly:

Applies the passage prefix when configured in the model's tasks

Converts the input iterable to a list when applying prefixes (acceptable for this use case)

Maintains backward compatibility for models without task-specific prefixes

Delegates to the existing embed method appropriately

Note: The same hasattr simplification mentioned in the previous comment applies to line 360.

171-190: Standardize model identifier casing for consistency.

The new model uses Snowflake/ (capital S) while all existing Snowflake Arctic Embed models in this file use snowflake/ (lowercase). HuggingFace Hub treats repository identifiers case-insensitively and resolves the canonical repo regardless of casing, so both forms work in practice. However, for code consistency, either:

Change this model to snowflake/snowflake-arctic-embed-l-v2.0 (lowercase) to match lines 112, 124, 136, 148, 160, or

Document why capital S is used if it's intentional (capital S is the canonical HuggingFace identifier)

fastembed/text/onnx_embedding.py

Fixes based on CodeRabbit review: 1. Remove redundant hasattr checks - tasks field has default factory 2. Fix model identifier casing from Snowflake/ to snowflake/ for consistency 3. Add comprehensive tests for prefix functionality: - test_query_passage_prefix: Verifies query prefix is applied correctly - test_prefix_backward_compatibility: Ensures models without prefix config work All tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

coderabbitai bot reviewed Oct 20, 2025

View reviewed changes

fastembed/text/onnx_embedding.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Snowflake Arctic Embed L v2.0 model #562

Add Snowflake Arctic Embed L v2.0 model #562

Uh oh!

aaronspring commented Oct 20, 2025

Uh oh!

coderabbitai bot commented Oct 20, 2025 •

edited

Loading

Rate limit exceeded

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Snowflake Arctic Embed L v2.0 model #562

Are you sure you want to change the base?

Add Snowflake Arctic Embed L v2.0 model #562

Uh oh!

Conversation

aaronspring commented Oct 20, 2025

Summary

Model Features

Changes

Test Plan

References

Uh oh!

coderabbitai bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Oct 20, 2025 •

edited

Loading