-
Notifications
You must be signed in to change notification settings - Fork 165
Add Snowflake Arctic Embed L v2.0 model #562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Snowflake Arctic Embed L v2.0 model #562
Conversation
Add support for Snowflake/snowflake-arctic-embed-l-v2.0, a multilingual embedding model with the following features: - 1024 dimensions - 74 languages support - 8192 token context length - Based on XLM-RoBERTa architecture - Supports Matryoshka learning for dimension truncation - Apache 2.0 license Changes: - Added model configuration to supported_onnx_models list - Added canonical test vectors for validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
Warning Rate limit exceeded@aaronspring has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 32 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughA new DenseModelDescription for Snowflake/snowflake-arctic-embed-l-v2.0 (dim=1024, embedding task with query_prefix "query: " and passage_prefix "", ONNX file paths, license, size, and source) was added to Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Implement automatic prefix handling for models with task-specific prefixes: - Added tasks field to model configuration with query_prefix and passage_prefix - Implemented query_embed() method to automatically prepend "query: " prefix - Implemented passage_embed() method (no prefix for this model) - Both methods check for tasks configuration and apply prefixes dynamically This enables optimal retrieval performance as recommended in the model documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
fastembed/text/onnx_embedding.py (1)
331-333: Simplify the condition by removing redundanthasattrcheck.Since
tasksis defined inDenseModelDescriptionwith a default factory (see relevant code snippets),hasattr(self.model_description, "tasks")will always returnTrue. The checkself.model_description.tasksalone is sufficient to determine if tasks are configured.Apply this diff:
- if hasattr(self.model_description, "tasks") and self.model_description.tasks: + if self.model_description.tasks:The same simplification applies to line 360 in
passage_embed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fastembed/text/onnx_embedding.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/text/onnx_embedding.py (2)
fastembed/common/model_description.py (1)
DenseModelDescription(35-40)fastembed/text/text_embedding_base.py (3)
query_embed(46-61)embed(22-29)passage_embed(31-44)
🔇 Additional comments (3)
fastembed/text/onnx_embedding.py (3)
317-346: Implementation correctly handles prefix application and input types.The method properly:
- Applies the query prefix when configured in the model's tasks
- Handles both single string and iterable inputs correctly
- Maintains backward compatibility for models without task-specific prefixes
- Delegates to the existing
embedmethod appropriately
348-369: Implementation correctly handles passage prefix application.The method properly:
- Applies the passage prefix when configured in the model's tasks
- Converts the input iterable to a list when applying prefixes (acceptable for this use case)
- Maintains backward compatibility for models without task-specific prefixes
- Delegates to the existing
embedmethod appropriatelyNote: The same
hasattrsimplification mentioned in the previous comment applies to line 360.
171-190: Standardize model identifier casing for consistency.The new model uses
Snowflake/(capital S) while all existing Snowflake Arctic Embed models in this file usesnowflake/(lowercase). HuggingFace Hub treats repository identifiers case-insensitively and resolves the canonical repo regardless of casing, so both forms work in practice. However, for code consistency, either:
- Change this model to
snowflake/snowflake-arctic-embed-l-v2.0(lowercase) to match lines 112, 124, 136, 148, 160, or- Document why capital S is used if it's intentional (capital S is the canonical HuggingFace identifier)
Fixes based on CodeRabbit review: 1. Remove redundant hasattr checks - tasks field has default factory 2. Fix model identifier casing from Snowflake/ to snowflake/ for consistency 3. Add comprehensive tests for prefix functionality: - test_query_passage_prefix: Verifies query prefix is applied correctly - test_prefix_backward_compatibility: Ensures models without prefix config work All tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Summary
Add support for
Snowflake/snowflake-arctic-embed-l-v2.0, a multilingual embedding model that improves upon the original Arctic Embed L model with expanded language support and longer context.Model Features
query:Changes
supported_onnx_modelslist infastembed/text/onnx_embedding.pytests/test_text_onnx_embeddings.pyonnx/model.onnxandonnx/model.onnx_datafilesTest Plan
References
🤖 Generated with Claude Code