Skip to content

Conversation

ravikant-diwakar
Copy link

…coderEnhance HuggingFaceCrossEncoder with padding token support

Fixes issue #32686: Error while using Qwen/Qwen3-Reranker-0.6B with Cross Encoder Reranker

  • Added _ensure_padding_token method to check and assign pad_token if missing
  • Added fallback logic in score() method to handle models without pad_token by processing text pairs individually
  • Added tokenizer reference to access padding token configuration
  • Improved error handling for ValueError with batch size > 1 limitations

Tested with:

  • Qwen/Qwen3-Reranker-0.6B model (models without pad_token)
  • Standard CrossEncoder models to ensure no regressions
  • Various batch sizes (1, 5, 10+) to validate functionality

Added padding token management and error handling for batch sizes in scoring.

…coderEnhance HuggingFaceCrossEncoder with padding token support

Fixes issue #32686: Error while using Qwen/Qwen3-Reranker-0.6B with Cross Encoder Reranker

- Added `_ensure_padding_token` method to check and assign pad_token if missing
- Added fallback logic in score() method to handle models without pad_token by processing text pairs individually
- Added tokenizer reference to access padding token configuration
- Improved error handling for ValueError with batch size > 1 limitations

Tested with:
- Qwen/Qwen3-Reranker-0.6B model (models without pad_token)
- Standard CrossEncoder models to ensure no regressions
- Various batch sizes (1, 5, 10+) to validate functionality

Signed-off-by: Ravikant Diwakar <[email protected]>Added padding token management and error handling for batch sizes in scoring.
Comment on lines +49 to +61
def _ensure_padding_token(self):
"""Ensure that a padding token is available for the tokenizer."""
if self.tokenizer.pad_token is None:
if self.tokenizer.eos_token:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.client.config.pad_token_id = self.tokenizer.eos_token_id
elif hasattr(self.tokenizer, 'unk_token') and self.tokenizer.unk_token:
self.tokenizer.pad_token = self.tokenizer.unk_token
self.client.config.pad_token_id = self.tokenizer.unk_token_id
else:
self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
self.client.resize_token_embeddings(len(self.tokenizer))
self.client.config.pad_token_id = self.tokenizer.pad_token_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new method _ensure_padding_token violates the 'Use Google-Style Docstrings (with Args section)' rule. While it has a basic docstring, it lacks the proper Google-style format with an Args section. Since this is a private method (indicated by the underscore prefix), it should still follow the docstring guidelines for consistency. The method should include a proper docstring with Args section describing any parameters it uses (even though it currently has none, the format should be established for future maintainability).

Suggested change
def _ensure_padding_token(self):
"""Ensure that a padding token is available for the tokenizer."""
if self.tokenizer.pad_token is None:
if self.tokenizer.eos_token:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.client.config.pad_token_id = self.tokenizer.eos_token_id
elif hasattr(self.tokenizer, 'unk_token') and self.tokenizer.unk_token:
self.tokenizer.pad_token = self.tokenizer.unk_token
self.client.config.pad_token_id = self.tokenizer.unk_token_id
else:
self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
self.client.resize_token_embeddings(len(self.tokenizer))
self.client.config.pad_token_id = self.tokenizer.pad_token_id
def _ensure_padding_token(self):
"""Ensure that a padding token is available for the tokenizer."""
if self.tokenizer.pad_token is None:
if self.tokenizer.eos_token:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.client.config.pad_token_id = self.tokenizer.eos_token_id
elif hasattr(self.tokenizer, 'unk_token') and self.tokenizer.unk_token:
self.tokenizer.pad_token = self.tokenizer.unk_token
self.client.config.pad_token_id = self.tokenizer.unk_token_id
else:
self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
self.client.resize_token_embeddings(len(self.tokenizer))
self.client.config.pad_token_id = self.tokenizer.pad_token_id

Spotted by Diamond (based on custom rule: Code quality)

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

if self.tokenizer.pad_token is None:
if self.tokenizer.eos_token:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.client.config.pad_token_id = self.tokenizer.eos_token_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential AttributeError: The code assumes self.client.config exists, but CrossEncoder instances may not have a config attribute. This will cause a runtime error when trying to set pad_token_id. Should check if the config attribute exists before accessing it, or handle the AttributeError exception.

Suggested change
self.client.config.pad_token_id = self.tokenizer.eos_token_id
if hasattr(self.client, "config"):
self.client.config.pad_token_id = self.tokenizer.eos_token_id

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

self.client.config.pad_token_id = self.tokenizer.unk_token_id
else:
self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
self.client.resize_token_embeddings(len(self.tokenizer))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential AttributeError: The code calls self.client.resize_token_embeddings() but CrossEncoder instances may not have this method. This is typically a method on transformer models, not CrossEncoder wrappers. This will cause a runtime error when executed.

Suggested change
self.client.resize_token_embeddings(len(self.tokenizer))
if hasattr(self.client, "resize_token_embeddings"):
self.client.resize_token_embeddings(len(self.tokenizer))

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant