Is it possible to have the model ignore certain tokens? #9339

Onyoursix · 2021-09-30T17:37:01Z

Onyoursix
Sep 30, 2021

I have documents with a very specific format, the formatting of the documents reduces the accuracy of my model unless I remove it. However, I do want to keep the original format. Let me give you an example.

1 This is a sentence with the line number listed. The person named John
2 Doe ran up the hill to get some water.

During the tokenization I would like to set the line numbers, line break, and extra spaces to be ignored by the model that way it recognizes "John Doe" as a person and not "John \n 2 Doe" while maintaining the original token index. Ultimately the final output will be something similar to Displacy, but in it's original format. I'm wondering if there is an easy way to go about this or if I need to develop something completely custom that deconstructs the formatting, then reconstructs it after it's been processed by the model.

Answered by Onyoursix

Oct 1, 2021

I found a work around solution by saving formatting as a custom token extension https://spacy.io/usage/processing-pipelines#description and then removing the unwanted formatting tokens in the same custom pipeline.

View full answer

Onyoursix · 2021-10-01T03:35:49Z

Onyoursix
Oct 1, 2021
Author

I found a work around solution by saving formatting as a custom token extension https://spacy.io/usage/processing-pipelines#description and then removing the unwanted formatting tokens in the same custom pipeline.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to have the model ignore certain tokens? #9339

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Is it possible to have the model ignore certain tokens? #9339

Onyoursix Sep 30, 2021

Replies: 1 comment

Onyoursix Oct 1, 2021 Author

Onyoursix
Sep 30, 2021

Onyoursix
Oct 1, 2021
Author