Fix: Add pad_token handling and batch fallback for HuggingFaceCrossEn… #331

ravikant-diwakar · 2025-09-19T04:47:12Z

…coderEnhance HuggingFaceCrossEncoder with padding token support

Fixes issue #32686: Error while using Qwen/Qwen3-Reranker-0.6B with Cross Encoder Reranker

Added _ensure_padding_token method to check and assign pad_token if missing
Added fallback logic in score() method to handle models without pad_token by processing text pairs individually
Added tokenizer reference to access padding token configuration
Improved error handling for ValueError with batch size > 1 limitations

Tested with:

Qwen/Qwen3-Reranker-0.6B model (models without pad_token)
Standard CrossEncoder models to ensure no regressions
Various batch sizes (1, 5, 10+) to validate functionality

Added padding token management and error handling for batch sizes in scoring.

…coderEnhance HuggingFaceCrossEncoder with padding token support Fixes issue #32686: Error while using Qwen/Qwen3-Reranker-0.6B with Cross Encoder Reranker - Added `_ensure_padding_token` method to check and assign pad_token if missing - Added fallback logic in score() method to handle models without pad_token by processing text pairs individually - Added tokenizer reference to access padding token configuration - Improved error handling for ValueError with batch size > 1 limitations Tested with: - Qwen/Qwen3-Reranker-0.6B model (models without pad_token) - Standard CrossEncoder models to ensure no regressions - Various batch sizes (1, 5, 10+) to validate functionality Signed-off-by: Ravikant Diwakar <[email protected]>Added padding token management and error handling for batch sizes in scoring.

graphite-app · 2025-09-19T04:49:30Z

libs/community/langchain_community/cross_encoders/huggingface.py

+    def _ensure_padding_token(self):
+        """Ensure that a padding token is available for the tokenizer."""
+        if self.tokenizer.pad_token is None:
+            if self.tokenizer.eos_token:
+                self.tokenizer.pad_token = self.tokenizer.eos_token
+                self.client.config.pad_token_id = self.tokenizer.eos_token_id
+            elif hasattr(self.tokenizer, 'unk_token') and self.tokenizer.unk_token:
+                self.tokenizer.pad_token = self.tokenizer.unk_token
+                self.client.config.pad_token_id = self.tokenizer.unk_token_id
+            else:
+                self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
+                self.client.resize_token_embeddings(len(self.tokenizer))
+                self.client.config.pad_token_id = self.tokenizer.pad_token_id


The new method _ensure_padding_token violates the 'Use Google-Style Docstrings (with Args section)' rule. While it has a basic docstring, it lacks the proper Google-style format with an Args section. Since this is a private method (indicated by the underscore prefix), it should still follow the docstring guidelines for consistency. The method should include a proper docstring with Args section describing any parameters it uses (even though it currently has none, the format should be established for future maintainability).

Suggested change

def _ensure_padding_token(self):

"""Ensure that a padding token is available for the tokenizer."""

if self.tokenizer.pad_token is None:

if self.tokenizer.eos_token:

self.tokenizer.pad_token = self.tokenizer.eos_token

self.client.config.pad_token_id = self.tokenizer.eos_token_id

elif hasattr(self.tokenizer, 'unk_token') and self.tokenizer.unk_token:

self.tokenizer.pad_token = self.tokenizer.unk_token

self.client.config.pad_token_id = self.tokenizer.unk_token_id

else:

self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})

self.client.resize_token_embeddings(len(self.tokenizer))

self.client.config.pad_token_id = self.tokenizer.pad_token_id

def _ensure_padding_token(self):

"""Ensure that a padding token is available for the tokenizer."""

if self.tokenizer.pad_token is None:

if self.tokenizer.eos_token:

self.tokenizer.pad_token = self.tokenizer.eos_token

self.client.config.pad_token_id = self.tokenizer.eos_token_id

elif hasattr(self.tokenizer, 'unk_token') and self.tokenizer.unk_token:

self.tokenizer.pad_token = self.tokenizer.unk_token

self.client.config.pad_token_id = self.tokenizer.unk_token_id

else:

self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})

self.client.resize_token_embeddings(len(self.tokenizer))

self.client.config.pad_token_id = self.tokenizer.pad_token_id

Spotted by Diamond (based on custom rule: Code quality)

Is this helpful? React 👍 or 👎 to let us know.

graphite-app · 2025-09-19T04:49:31Z

libs/community/langchain_community/cross_encoders/huggingface.py

+        if self.tokenizer.pad_token is None:
+            if self.tokenizer.eos_token:
+                self.tokenizer.pad_token = self.tokenizer.eos_token
+                self.client.config.pad_token_id = self.tokenizer.eos_token_id


Potential AttributeError: The code assumes self.client.config exists, but CrossEncoder instances may not have a config attribute. This will cause a runtime error when trying to set pad_token_id. Should check if the config attribute exists before accessing it, or handle the AttributeError exception.

Suggested change

self.client.config.pad_token_id = self.tokenizer.eos_token_id

if hasattr(self.client, "config"):

self.client.config.pad_token_id = self.tokenizer.eos_token_id

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

graphite-app · 2025-09-19T04:49:33Z

libs/community/langchain_community/cross_encoders/huggingface.py

+                self.client.config.pad_token_id = self.tokenizer.unk_token_id
+            else:
+                self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
+                self.client.resize_token_embeddings(len(self.tokenizer))


Potential AttributeError: The code calls self.client.resize_token_embeddings() but CrossEncoder instances may not have this method. This is typically a method on transformer models, not CrossEncoder wrappers. This will cause a runtime error when executed.

Suggested change

self.client.resize_token_embeddings(len(self.tokenizer))

if hasattr(self.client, "resize_token_embeddings"):

self.client.resize_token_embeddings(len(self.tokenizer))

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

graphite-app bot reviewed Sep 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Add pad_token handling and batch fallback for HuggingFaceCrossEn… #331

Fix: Add pad_token handling and batch fallback for HuggingFaceCrossEn… #331

Uh oh!

ravikant-diwakar commented Sep 19, 2025

Uh oh!

graphite-app bot Sep 19, 2025

Uh oh!

graphite-app bot Sep 19, 2025

Uh oh!

graphite-app bot Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	self.client.config.pad_token_id = self.tokenizer.eos_token_id
	if hasattr(self.client, "config"):
	self.client.config.pad_token_id = self.tokenizer.eos_token_id

	self.client.resize_token_embeddings(len(self.tokenizer))
	if hasattr(self.client, "resize_token_embeddings"):
	self.client.resize_token_embeddings(len(self.tokenizer))

Fix: Add pad_token handling and batch fallback for HuggingFaceCrossEn… #331

Are you sure you want to change the base?

Fix: Add pad_token handling and batch fallback for HuggingFaceCrossEn… #331

Uh oh!

Conversation

ravikant-diwakar commented Sep 19, 2025

Uh oh!

graphite-app bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

graphite-app bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

graphite-app bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant