How to create a custom tokenizer for Korean? #9207

kanayer · 2021-09-14T06:32:09Z

kanayer
Sep 14, 2021

The existing spaCy tokenizer for Korean used morpheme segmentation and is incompatible with the existing Korean UD corpora that used word segmentation. Therefore I wanted to create a custom word-based tokenizer for Korean but I'm a little bit confused with the process.

What should be specified in the config file for the tokenizer? When I download config.cfg via quickstart widgets and try to run it using spacy init config config.cfg --force it ends up rewriting the language from 'ko' to 'en'
Does the tokenizer need to be trained? And does it require an existing vocab file or will it be created after training?
Does the tokenizer need to be added to the pipeline?

I tried to rewrite the code for the tokenizer and attached it below. I will be very thankful if someone lets me know if this is the correct approach.

Answered by adrianeboyd

Sep 14, 2021

In the future, please use markdown code blocks (with three backticks on a separate line before and after the code) instead of screenshots, it's much easier to read / search / debug code as text in the forum.

You don't want a whitespace tokenizer OR the default Korean tokenizer, and you don't need to implement a custom tokenizer as shown above.

In a config for a Korean UD corpus (you may or may not want both the tagger and the morphologizer, but this example includes both), it looks like this:

[nlp]
lang = "ko"
pipeline = ["tok2vec","tagger","morphologizer","parser"]
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
…

View full answer

adrianeboyd · 2021-09-14T07:15:13Z

adrianeboyd
Sep 14, 2021

In the future, please use markdown code blocks (with three backticks on a separate line before and after the code) instead of screenshots, it's much easier to read / search / debug code as text in the forum.

You don't want a whitespace tokenizer OR the default Korean tokenizer, and you don't need to implement a custom tokenizer as shown above.

In a config for a Korean UD corpus (you may or may not want both the tagger and the morphologizer, but this example includes both), it looks like this:

[nlp]
lang = "ko"
pipeline = ["tok2vec","tagger","morphologizer","parser"]
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null

Generate the initial config like this and then just edit the tokenizer = line by hand:

spacy init config -l ko -p tagger,morphologizer,parser config.cfg

spacy.Tokenizer.v1 has default settings that tokenize common punctuation and a few other things, so it's not just a whitespace tokenizer by default. I got 100% tokenization on UD Korean Kaist with no customization beyond the language-independent(-ish) defaults.

5 replies

adrianeboyd Sep 14, 2021

With spacy and natto-py installed, this should just work:

import spacy
nlp = spacy.blank("ko")
print(nlp.tokenizer.mecab_tokenizer)
# <natto.mecab.MeCab model=<cdata 'mecab_model_t *' 0x57bd450>, tagger=<cdata 'mecab_t *' 0x55b09d0>, lattice=<cdata 'mecab_lattice_t *' 0x52b46c0>, libpath="/usr/local/lib/libmecab.so", options={'node_format': '%f[0],%f[7]'}, dicts=[<natto.dictionary.DictionaryInfo dictionary=<cdata 'mecab_dictionary_info_t *' 0x55fab10>, filepath="/usr/local/lib/mecab/dic/mecab-ko-dic/sys.dic", charset=UTF-8, type=0>], version=0.996/ko-0.9.2>

Be sure you've reverted any of your edits to spacy.lang.ko, maybe? Installing mecab-ko is a total hassle, but I assume you've already done that in the past for other work?

The config generated with spacy init config is used with spacy train. See an example UD project here (adjust all the variables and filenames in project.yml from English for the Korean UD corpus): https://github.com/explosion/projects/tree/v3/pipelines/tagger_parser_ud

And see the spacy projects documentation here for how to run the example project: https://github.com/explosion/projects/tree/v3/pipelines/tagger_parser_ud

And the provided config is fine for initial testing, but you probably want to update config/sconfig.cfg with the most recent config from spacy init config before using for real with the most recent version of spacy and potentially adjust it for your task, too, of course.

The Korean UD performance is not going to be great overall, especially without vectors, and even with vectors, the vectors with word tokenization are going to have data sparsity issues due to the agglutinative morphology. We're working on some a new vectors options for v3.2 that work better for Korean and similar languages, so stay tuned!

kanayer Sep 14, 2021
Author

It works fine now, thank you a lot!
Earlier when you said you got 100% tokenization on UD Korean Kaist, did you have to use spacy train to get this result and check the tokenizer performance?
I guess a more general question is do you have to use spacy train every time you update the pipeline component?

adrianeboyd Sep 14, 2021

If you want to evaluate the default tokenizer for a language you can use spacy evaluate blank:lg:

spacy evaluate blank:ko dev.spacy

But this doesn't work for the customization in the config above, so you have to save out a model. You can use whatever model you've trained and saved as training/model-best with spacy evaluate. If you use spacy evaluate -o metrics.json you can see all the tokenization metrics from the scorer in the saved output.

spacy evaluate training/model-best dev.spacy -o metrics.json

You can also add the token evaluation to the spacy train output by adding token_acc/token_f/token_p/token_r to [training.score_weights] with a weight value (it's kind of repetitive as an output column; note that TOK from spacy evaluate and spacy's reported pipeline metrics on the models website is token_acc, but token_p/r/f is usually what you want for word segmentation tasks like for Chinese).

There are a number of other ways to generate a tokenizer-only pipeline, which would be slightly faster to evaluate if you're just evaluating the tokenization. If you're just working on the tokenizer and not training other components, you can use spacy assemble to set up the pipeline with the tokenizer from the config, or you can do it programmatically instead if that's easier.

We recommend using spacy train to update pipeline components like a tagger because it handles a lot of the training loop details for you. You don't need it for the tokenizer (which isn't technically a pipeline "component" within a spacy pipeline, either).

kanayer Sep 28, 2021
Author

@adrianeboyd Thank you so much for your answer. Could you please share your opinion on the optimal/minimum amount of data (sentences) required for the POS tagging (and NER) training task? I understand that it largely depends on the complexity of the model, but if you had to make a rough guess, what number would you say?

Regarding the model architecture, I was thinking to use two different models for comparison: Pointer Generator with CRF using Pytorch and Transformers.

Currently, the Korean KAIST treebank has 37000 sentences (350 000 words), and the Thai treebank (found separately) has 38,558 sentences (~940000 words). Aren't ~37k sentences too little amount of data to get high accuracy results? The reason why I'm asking is that my company can expand the dataset by hiring annotators for both Korean and Thai and we were discussing optimal dataset sizes. If you don't mind, could you please give some feedback on the amount of needed data?

adrianeboyd Sep 28, 2021

This is really hard to say. What you can do is train with increasing portions of your training data (e.g., 25%, 50%, 75%, 100%) and see if you're still seeing improvements in the eval scores in the last step. Prodigy does this with its train-curve recipe, just as an example with a simple graph to show what this can look like in practice (scroll down past the options to the example): https://prodi.gy/docs/recipes#train-curve

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create a custom tokenizer for Korean? #9207

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to create a custom tokenizer for Korean? #9207

kanayer Sep 14, 2021

Replies: 1 comment · 5 replies

adrianeboyd Sep 14, 2021

adrianeboyd Sep 14, 2021

kanayer Sep 14, 2021 Author

adrianeboyd Sep 14, 2021

kanayer Sep 28, 2021 Author

adrianeboyd Sep 28, 2021

kanayer
Sep 14, 2021

Replies: 1 comment 5 replies

adrianeboyd
Sep 14, 2021

kanayer Sep 14, 2021
Author

kanayer Sep 28, 2021
Author