How to create a custom tokenizer for Korean? #9207
-
The existing spaCy tokenizer for Korean used morpheme segmentation and is incompatible with the existing Korean UD corpora that used word segmentation. Therefore I wanted to create a custom word-based tokenizer for Korean but I'm a little bit confused with the process.
I tried to rewrite the code for the tokenizer and attached it below. I will be very thankful if someone lets me know if this is the correct approach. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
In the future, please use markdown code blocks (with three backticks on a separate line before and after the code) instead of screenshots, it's much easier to read / search / debug code as text in the forum. You don't want a whitespace tokenizer OR the default Korean tokenizer, and you don't need to implement a custom tokenizer as shown above. In a config for a Korean UD corpus (you may or may not want both the [nlp]
lang = "ko"
pipeline = ["tok2vec","tagger","morphologizer","parser"]
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null Generate the initial config like this and then just edit the spacy init config -l ko -p tagger,morphologizer,parser config.cfg
|
Beta Was this translation helpful? Give feedback.
In the future, please use markdown code blocks (with three backticks on a separate line before and after the code) instead of screenshots, it's much easier to read / search / debug code as text in the forum.
You don't want a whitespace tokenizer OR the default Korean tokenizer, and you don't need to implement a custom tokenizer as shown above.
In a config for a Korean UD corpus (you may or may not want both the
tagger
and themorphologizer
, but this example includes both), it looks like this: