Skip to content

Conversation

@aaronspring
Copy link

Summary

Add support for Snowflake/snowflake-arctic-embed-l-v2.0, a multilingual embedding model that improves upon the original Arctic Embed L model with expanded language support and longer context.

Model Features

  • Dimensions: 1024
  • Languages: 74 languages (multilingual via XLM-RoBERTa)
  • Context Length: 8192 tokens (vs 512 in v1)
  • Architecture: Based on BAAI/bge-m3-retromae (XLM-RoBERTa)
  • Special Capabilities:
    • Supports Matryoshka learning for dimension truncation to 256 dims
    • Compatible with 4-bit quantization
    • Query prefix recommended: query:
  • License: Apache 2.0
  • Model Size: 2.27 GB (ONNX format with separate data file)

Changes

  • Added model configuration to supported_onnx_models list in fastembed/text/onnx_embedding.py
  • Added canonical test vectors to tests/test_text_onnx_embeddings.py
  • Model includes both onnx/model.onnx and onnx/model.onnx_data files

Test Plan

  • Model loads successfully from HuggingFace
  • Embeddings generated with expected shape (batch_size, 1024)
  • Canonical vectors match within tolerance (atol=1e-3)
  • Model appears in list of supported models

References

🤖 Generated with Claude Code

Add support for Snowflake/snowflake-arctic-embed-l-v2.0, a multilingual embedding model with the following features:
- 1024 dimensions
- 74 languages support
- 8192 token context length
- Based on XLM-RoBERTa architecture
- Supports Matryoshka learning for dimension truncation
- Apache 2.0 license

Changes:
- Added model configuration to supported_onnx_models list
- Added canonical test vectors for validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@coderabbitai
Copy link

coderabbitai bot commented Oct 20, 2025

Warning

Rate limit exceeded

@aaronspring has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 32 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 730634a and 7f41fba.

📒 Files selected for processing (2)
  • fastembed/text/onnx_embedding.py (2 hunks)
  • tests/test_text_onnx_embeddings.py (2 hunks)
📝 Walkthrough

Walkthrough

A new DenseModelDescription for Snowflake/snowflake-arctic-embed-l-v2.0 (dim=1024, embedding task with query_prefix "query: " and passage_prefix "", ONNX file paths, license, size, and source) was added to supported_onnx_models in fastembed/text/onnx_embedding.py. Two public methods were introduced to OnnxTextEmbedding: query_embed(self, query: Union[str, Iterable[str]], **kwargs) and passage_embed(self, texts: Iterable[str], **kwargs), which apply model-specific query/passage prefixes when present and delegate to the existing embed logic. A canonical 5-element embedding vector for the new model was added to tests/test_text_onnx_embeddings.py.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "Add Snowflake Arctic Embed L v2.0 model" directly and clearly describes the primary change in the changeset. According to the raw summary, the main modifications are adding a new model configuration for Snowflake/snowflake-arctic-embed-l-v2.0 to the supported models list, along with supporting methods and test vectors. The title accurately captures this core change with concise and specific language, making it clear to reviewers what the PR accomplishes without needing to list secondary changes like the new methods or test updates.
Description Check ✅ Passed The PR description is well-related to the changeset and provides meaningful information. It explains the model being added, details its key features (dimensions, language support, context length, architecture), documents the specific code changes made, includes a verification test plan with checkmarks, and provides references. The description is comprehensive and directly corresponds to the changes summarized in the raw summary, including the model configuration addition and test vectors, making it far more than just related but actually thorough and helpful for understanding the PR's purpose and scope.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Implement automatic prefix handling for models with task-specific prefixes:
- Added tasks field to model configuration with query_prefix and passage_prefix
- Implemented query_embed() method to automatically prepend "query: " prefix
- Implemented passage_embed() method (no prefix for this model)
- Both methods check for tasks configuration and apply prefixes dynamically

This enables optimal retrieval performance as recommended in the model documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
fastembed/text/onnx_embedding.py (1)

331-333: Simplify the condition by removing redundant hasattr check.

Since tasks is defined in DenseModelDescription with a default factory (see relevant code snippets), hasattr(self.model_description, "tasks") will always return True. The check self.model_description.tasks alone is sufficient to determine if tasks are configured.

Apply this diff:

-        if hasattr(self.model_description, "tasks") and self.model_description.tasks:
+        if self.model_description.tasks:

The same simplification applies to line 360 in passage_embed.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6f8bc1 and 730634a.

📒 Files selected for processing (1)
  • fastembed/text/onnx_embedding.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/text/onnx_embedding.py (2)
fastembed/common/model_description.py (1)
  • DenseModelDescription (35-40)
fastembed/text/text_embedding_base.py (3)
  • query_embed (46-61)
  • embed (22-29)
  • passage_embed (31-44)
🔇 Additional comments (3)
fastembed/text/onnx_embedding.py (3)

317-346: Implementation correctly handles prefix application and input types.

The method properly:

  • Applies the query prefix when configured in the model's tasks
  • Handles both single string and iterable inputs correctly
  • Maintains backward compatibility for models without task-specific prefixes
  • Delegates to the existing embed method appropriately

348-369: Implementation correctly handles passage prefix application.

The method properly:

  • Applies the passage prefix when configured in the model's tasks
  • Converts the input iterable to a list when applying prefixes (acceptable for this use case)
  • Maintains backward compatibility for models without task-specific prefixes
  • Delegates to the existing embed method appropriately

Note: The same hasattr simplification mentioned in the previous comment applies to line 360.


171-190: Standardize model identifier casing for consistency.

The new model uses Snowflake/ (capital S) while all existing Snowflake Arctic Embed models in this file use snowflake/ (lowercase). HuggingFace Hub treats repository identifiers case-insensitively and resolves the canonical repo regardless of casing, so both forms work in practice. However, for code consistency, either:

  1. Change this model to snowflake/snowflake-arctic-embed-l-v2.0 (lowercase) to match lines 112, 124, 136, 148, 160, or
  2. Document why capital S is used if it's intentional (capital S is the canonical HuggingFace identifier)

Fixes based on CodeRabbit review:
1. Remove redundant hasattr checks - tasks field has default factory
2. Fix model identifier casing from Snowflake/ to snowflake/ for consistency
3. Add comprehensive tests for prefix functionality:
   - test_query_passage_prefix: Verifies query prefix is applied correctly
   - test_prefix_backward_compatibility: Ensures models without prefix config work

All tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant