Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.tokenize()问题 #83

Closed
472027909 opened this issue Jun 10, 2022 · 1 comment
Closed

tokenizer.tokenize()问题 #83

472027909 opened this issue Jun 10, 2022 · 1 comment

Comments

@472027909
Copy link

run_ner_crf.py
train分支时,为什么from_pretrained()函数后留了一个逗号,导致,tokenizer.tokenize()分词的不是逐个字符分的,导致tokens和label_ids长度不等。
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, do_lower_case=args.do_lower_case, )

run_ner_span.py
这个就是用的不带逗号的from_pretrained()tokenizer.tokenize()分词是逐个字符分的,这样tokens和label_ids长度相等。
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, do_lower_case=args.do_lower_case)

这么写的区别在哪,
感觉这是个bug
修正后,最大的问题,用run_ner_crf.py训练自己数据集,两个epoch之后recall就会一直下降,这个问题作者遇到过吗,有空麻烦回一下

@lsx0930
Copy link

lsx0930 commented Jun 26, 2022

#86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants