ik fails to output the right offsets if the char fitlers apply to the input stream #136

GoogleCodeExporter · 2016-02-14T06:26:23Z

Hi Team,
我发现ik 
tokenizer对html_filter处理过的字符串输出offsets有误。html_filter的
base class 
BaseCharFilter里包含了offsets和diffs两个数组，分别是stripped以后�
��tokens的offsets和相对于源string需要修正的delta。ik（我用的ik20
12 FF hotfix1，google 
code）的代码，没有对这个offsets和diffs处理。导致输出的offset�
��处理后的无html 
tag的string上的offset。我在我的github上做了修改，大致测了一��
�貌似可以了。主要修改在这个github的pull request上了。

https://github.com/xpandan/ik-analyzer/commit/7cc797ca78399cdae4f31181970e85db28
be4e5d

html_strip本身也不少bug，你也可以用mapping 
filter来测，原理一样的。有空帮我review下code吧。我是为了项�
��临时来研究lucene的，请多多指教。


Best,
Dan

Original issue reported on code.google.com by [email protected] on 12 Sep 2014 at 10:34

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter added Priority-Medium Type-Defect auto-migrated labels Feb 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ik fails to output the right offsets if the char fitlers apply to the input stream #136

ik fails to output the right offsets if the char fitlers apply to the input stream #136

GoogleCodeExporter commented Feb 14, 2016

ik fails to output the right offsets if the char fitlers apply to the input stream #136

ik fails to output the right offsets if the char fitlers apply to the input stream #136

Comments

GoogleCodeExporter commented Feb 14, 2016