-
Notifications
You must be signed in to change notification settings - Fork 247
Fix: Add pad_token handling and batch fallback for HuggingFaceCrossEn… #331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix: Add pad_token handling and batch fallback for HuggingFaceCrossEn… #331
Conversation
…coderEnhance HuggingFaceCrossEncoder with padding token support Fixes issue #32686: Error while using Qwen/Qwen3-Reranker-0.6B with Cross Encoder Reranker - Added `_ensure_padding_token` method to check and assign pad_token if missing - Added fallback logic in score() method to handle models without pad_token by processing text pairs individually - Added tokenizer reference to access padding token configuration - Improved error handling for ValueError with batch size > 1 limitations Tested with: - Qwen/Qwen3-Reranker-0.6B model (models without pad_token) - Standard CrossEncoder models to ensure no regressions - Various batch sizes (1, 5, 10+) to validate functionality Signed-off-by: Ravikant Diwakar <[email protected]>Added padding token management and error handling for batch sizes in scoring.
def _ensure_padding_token(self): | ||
"""Ensure that a padding token is available for the tokenizer.""" | ||
if self.tokenizer.pad_token is None: | ||
if self.tokenizer.eos_token: | ||
self.tokenizer.pad_token = self.tokenizer.eos_token | ||
self.client.config.pad_token_id = self.tokenizer.eos_token_id | ||
elif hasattr(self.tokenizer, 'unk_token') and self.tokenizer.unk_token: | ||
self.tokenizer.pad_token = self.tokenizer.unk_token | ||
self.client.config.pad_token_id = self.tokenizer.unk_token_id | ||
else: | ||
self.tokenizer.add_special_tokens({'pad_token': '[PAD]'}) | ||
self.client.resize_token_embeddings(len(self.tokenizer)) | ||
self.client.config.pad_token_id = self.tokenizer.pad_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new method _ensure_padding_token
violates the 'Use Google-Style Docstrings (with Args section)' rule. While it has a basic docstring, it lacks the proper Google-style format with an Args section. Since this is a private method (indicated by the underscore prefix), it should still follow the docstring guidelines for consistency. The method should include a proper docstring with Args section describing any parameters it uses (even though it currently has none, the format should be established for future maintainability).
def _ensure_padding_token(self): | |
"""Ensure that a padding token is available for the tokenizer.""" | |
if self.tokenizer.pad_token is None: | |
if self.tokenizer.eos_token: | |
self.tokenizer.pad_token = self.tokenizer.eos_token | |
self.client.config.pad_token_id = self.tokenizer.eos_token_id | |
elif hasattr(self.tokenizer, 'unk_token') and self.tokenizer.unk_token: | |
self.tokenizer.pad_token = self.tokenizer.unk_token | |
self.client.config.pad_token_id = self.tokenizer.unk_token_id | |
else: | |
self.tokenizer.add_special_tokens({'pad_token': '[PAD]'}) | |
self.client.resize_token_embeddings(len(self.tokenizer)) | |
self.client.config.pad_token_id = self.tokenizer.pad_token_id | |
def _ensure_padding_token(self): | |
"""Ensure that a padding token is available for the tokenizer.""" | |
if self.tokenizer.pad_token is None: | |
if self.tokenizer.eos_token: | |
self.tokenizer.pad_token = self.tokenizer.eos_token | |
self.client.config.pad_token_id = self.tokenizer.eos_token_id | |
elif hasattr(self.tokenizer, 'unk_token') and self.tokenizer.unk_token: | |
self.tokenizer.pad_token = self.tokenizer.unk_token | |
self.client.config.pad_token_id = self.tokenizer.unk_token_id | |
else: | |
self.tokenizer.add_special_tokens({'pad_token': '[PAD]'}) | |
self.client.resize_token_embeddings(len(self.tokenizer)) | |
self.client.config.pad_token_id = self.tokenizer.pad_token_id | |
Spotted by Diamond (based on custom rule: Code quality)
Is this helpful? React 👍 or 👎 to let us know.
if self.tokenizer.pad_token is None: | ||
if self.tokenizer.eos_token: | ||
self.tokenizer.pad_token = self.tokenizer.eos_token | ||
self.client.config.pad_token_id = self.tokenizer.eos_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential AttributeError: The code assumes self.client.config
exists, but CrossEncoder instances may not have a config
attribute. This will cause a runtime error when trying to set pad_token_id
. Should check if the config attribute exists before accessing it, or handle the AttributeError exception.
self.client.config.pad_token_id = self.tokenizer.eos_token_id | |
if hasattr(self.client, "config"): | |
self.client.config.pad_token_id = self.tokenizer.eos_token_id |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
self.client.config.pad_token_id = self.tokenizer.unk_token_id | ||
else: | ||
self.tokenizer.add_special_tokens({'pad_token': '[PAD]'}) | ||
self.client.resize_token_embeddings(len(self.tokenizer)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential AttributeError: The code calls self.client.resize_token_embeddings()
but CrossEncoder instances may not have this method. This is typically a method on transformer models, not CrossEncoder wrappers. This will cause a runtime error when executed.
self.client.resize_token_embeddings(len(self.tokenizer)) | |
if hasattr(self.client, "resize_token_embeddings"): | |
self.client.resize_token_embeddings(len(self.tokenizer)) |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
…coderEnhance HuggingFaceCrossEncoder with padding token support
Fixes issue #32686: Error while using Qwen/Qwen3-Reranker-0.6B with Cross Encoder Reranker
_ensure_padding_token
method to check and assign pad_token if missingTested with:
Added padding token management and error handling for batch sizes in scoring.