Is it possible to have the model ignore certain tokens? #9339
-
I have documents with a very specific format, the formatting of the documents reduces the accuracy of my model unless I remove it. However, I do want to keep the original format. Let me give you an example. 1 This is a sentence with the line number listed. The person named John During the tokenization I would like to set the line numbers, line break, and extra spaces to be ignored by the model that way it recognizes "John Doe" as a person and not "John \n 2 Doe" while maintaining the original token index. Ultimately the final output will be something similar to Displacy, but in it's original format. I'm wondering if there is an easy way to go about this or if I need to develop something completely custom that deconstructs the formatting, then reconstructs it after it's been processed by the model. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I found a work around solution by saving formatting as a custom token extension https://spacy.io/usage/processing-pipelines#description and then removing the unwanted formatting tokens in the same custom pipeline. |
Beta Was this translation helpful? Give feedback.
I found a work around solution by saving formatting as a custom token extension https://spacy.io/usage/processing-pipelines#description and then removing the unwanted formatting tokens in the same custom pipeline.