Commit 964eb48
Lol4to
Fix unicode processing +
* As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly
* With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing.
Russian is an example. So this fixes grangier#223 support1 parent 09023ec commit 964eb48
1 file changed
+6
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| 31 | + | |
31 | 32 | | |
32 | 33 | | |
33 | 34 | | |
| |||
106 | 107 | | |
107 | 108 | | |
108 | 109 | | |
| 110 | + | |
109 | 111 | | |
110 | | - | |
111 | | - | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
112 | 115 | | |
113 | 116 | | |
114 | | - | |
| 117 | + | |
115 | 118 | | |
116 | 119 | | |
117 | 120 | | |
| |||
0 commit comments