Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARSeq Model #2089

Draft
wants to merge 46 commits into
base: master
Choose a base branch
from
Draft

PARSeq Model #2089

wants to merge 46 commits into from

Conversation

sineeli
Copy link
Collaborator

@sineeli sineeli commented Feb 10, 2025

No description provided.

nit
nit
@abheesht17
Copy link
Collaborator

@sineeli - which parts of the PR are ready for review? Asking because it's still marked as draft

@sineeli
Copy link
Collaborator Author

sineeli commented Feb 20, 2025

Sure @abheesht17

First preprocessing and tokenizer these parts I think are good for reviewing, as they are the primary steps.

  1. keras_hub/src/models/parseq/parseq_tokenizer.py
  2. keras_hub/src/models/text_recognition_preprocessor.py

Copy link
Collaborator

@abheesht17 abheesht17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left some comments on the tokeniser. Will take a look at the text recognition preprocessor soon.

Sorry for the delay in reviewing

"keras_hub.models.PARSeqTokenizer",
]
)
class PARSeqTokenizer(tokenizer.Tokenizer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a doc-string here, with examples. Makes it easier to review when we have examples :P

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add unit tests as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add them

Comment on lines 64 to 81
self.char_to_id = tf.lookup.StaticHashTable(
initializer=tf.lookup.KeyValueTensorInitializer(
keys=list(self._stoi.keys()),
values=list(self._stoi.values()),
key_dtype=tf.string,
value_dtype=tf.int32,
),
default_value=0,
)
self.id_to_char = tf.lookup.StaticHashTable(
initializer=tf.lookup.KeyValueTensorInitializer(
keys=list(self._stoi.values()),
values=list(self._stoi.keys()),
key_dtype=tf.int32,
value_dtype=tf.string,
),
default_value=self.pad_token,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaults don't match. EOS is the 0th token, and pad is the len(vocabulary) - 1th token

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognized the same in the original code, but seems they are using EOS -> 0, BOS->len(vocabulary), but while padding they are doing BOS first and then EOS at the end.

),
default_value=0,
)
self.id_to_char = tf.lookup.StaticHashTable(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? We aren't using it anywhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in case if user wants to bulk change the token ids to characters it will be helpful

label = tf.strings.upper(label)

label = tf.strings.regex_replace(label, self.unsupported_regex, "")
label = tf.strings.substr(label, 0, self.max_label_length)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we truncating the input to 25 characters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While preparing the dataset in the preprocessing itself if the label is above 25 they jus ignore that datapoint itself. Instead I truncated and we can start and end tokens instead.

Ref: https://github.com/baudm/parseq/blob/1902db043c029a7e03a3818c616c06600af574be/strhub/data/dataset.py#L112

sineeli added 3 commits March 3, 2025 11:42
sineeli added 22 commits March 25, 2025 11:35
nit
nit
nit
@sachinprasadhs sachinprasadhs added the WIP Pull requests which are work in progress and not ready yet for review. label Apr 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Pull requests which are work in progress and not ready yet for review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants