Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PinyinHelper.toHanYuPinyinString的分隔符有bug #42

Open
jiangyexin opened this issue Dec 1, 2021 · 1 comment
Open

PinyinHelper.toHanYuPinyinString的分隔符有bug #42

jiangyexin opened this issue Dec 1, 2021 · 1 comment

Comments

@jiangyexin
Copy link

image

在多音字库里的会带上分隔符,不在多音字库里的词有的没有分隔符

@zhaiwei3000
Copy link

自己重写了一下这个方法net.sourceforge.pinyin4j.PinyinHelper#toHanYuPinyinString

你可以用这个试试。自测还可以。
/**
* 替换原来的toHanYuPinyinString
* 原来的有bug:
* 原来的比如 "一二三" 用#分割后,结果为yi#ersan,最后二和三之间的#丢失了。有的会丢,有的不会丢,取决于词组中是否有包含一的词组
* add by zhaiwei5
* @param str
* @param outputFormat
* @param separate
* @param retain
* @return
* @throws BadHanyuPinyinOutputFormatCombination
*/
static public String toHanYuPinyinString(String str, HanyuPinyinOutputFormat outputFormat,
String separate, boolean retain) throws BadHanyuPinyinOutputFormatCombination {
ChineseToPinyinResource resource = ChineseToPinyinResource.getInstance();
//装拼音的list
List list = new ArrayList<>();
char[] chars = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
//匹配到的最长的结果
String result = null;
char currentChar = chars[i];
Trie root = resource.getUnicodeToHanyuPinyinTable();
int index = i;
//当前字符的编码
String hexStr = Integer.toHexString(currentChar).toUpperCase();
//当前字符对应的根节点对象
Trie nowTrie = root.get(hexStr);
//判断有没有在配置文件中设置拼音
if (nowTrie == null || nowTrie.getPinyin() == null) {
if (retain) {
list.add(Character.toString(chars[i]));
}
//没有设置则跳出当前循环,继续找下一个
continue;
}
result = nowTrie.getPinyin();
if (i + 1 == chars.length) {
//是最后一个字符了,解析拼音
String[] pinyinStrArray = resource.parsePinyinString(result);
//多音字默认取第一个
list.add(PinyinFormatter.formatHanyuPinyin(pinyinStrArray[0], outputFormat));
} else {
//是否是词组
boolean isMulti = false;
//当前字符的下一个节点
Trie nextMap = nowTrie.getNextTire();
while (true) {
if (index + 1 == chars.length) {
//尽最大努力匹配到最后一个字符了
break;
}
//取下一个字符
char nextChar = chars[++index];
//前一个字符有匹配的词组
if (nextMap != null) {
//匹配字符的下一个字符对应的节点对象
Trie nextTrie = nextMap.get(Integer.toHexString(nextChar).toUpperCase());
if (nextTrie == null) {
break;
}
if (nextTrie.getPinyin() != null) {
//是词组,尽最大努力匹配最多字的词组
result = nextTrie.getPinyin();
isMulti = true;
//index前面已经加1了
i = index;
}
//下一个节点
nextMap = nextTrie.getNextTire();
} else {
break;
}
}
String[] pinyinStrArray = resource.parsePinyinString(result);
if (!isMulti) {
//如果当前字符及其后的字符没有匹配上词组,则取当前字符的拼音
//如果是多音字取第一个读音
list.add(PinyinFormatter.formatHanyuPinyin(pinyinStrArray[0], outputFormat));
} else {
//词组
for (String SinglePinyin : pinyinStrArray) {
list.add(PinyinFormatter.formatHanyuPinyin(SinglePinyin, outputFormat));
}
}
}
}
String collect = list.stream().collect(Collectors.joining(separate == null ? "" : separate));
return collect;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants