Skip to content

How to create a custom tokenizer for Korean? #9207

Discussion options

You must be logged in to vote

In the future, please use markdown code blocks (with three backticks on a separate line before and after the code) instead of screenshots, it's much easier to read / search / debug code as text in the forum.

You don't want a whitespace tokenizer OR the default Korean tokenizer, and you don't need to implement a custom tokenizer as shown above.

In a config for a Korean UD corpus (you may or may not want both the tagger and the morphologizer, but this example includes both), it looks like this:

[nlp]
lang = "ko"
pipeline = ["tok2vec","tagger","morphologizer","parser"]
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null

Replies: 1 comment 5 replies

Comment options

You must be logged in to vote
5 replies
@adrianeboyd
Comment options

@kanayer
Comment options

@adrianeboyd
Comment options

@kanayer
Comment options

@adrianeboyd
Comment options

Answer selected by kanayer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / ko Korean language data and models feat / tokenizer Feature: Tokenizer
2 participants