Skip to content

Commit 964eb48

Browse files
author
Lol4to
committed
Fix unicode processing +   support
* As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly * With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing. Russian is an example. So this fixes grangier#223
1 parent 09023ec commit 964eb48

File tree

1 file changed

+6
-3
lines changed

1 file changed

+6
-3
lines changed

goose/text.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
from goose.utils.encoding import smart_str
2929
from goose.utils.encoding import DjangoUnicodeDecodeError
3030

31+
SPACE_SYMBOLS = re.compile(ur'[\s\xa0\t]')
3132
TABSSPACE = re.compile(r'[\s\t]+')
3233

3334

@@ -106,12 +107,14 @@ def __init__(self, language='en'):
106107
def remove_punctuation(self, content):
107108
# code taken form
108109
# http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
110+
translate = lambda data: data.translate(self.TRANS_TABLE, string.punctuation)
109111
if isinstance(content, unicode):
110-
content = content.encode('utf-8')
111-
return content.translate(self.TRANS_TABLE, string.punctuation)
112+
return translate(content.encode('utf-8')).decode('utf-8') # Don't forget to decode back if encoded
113+
else:
114+
return translate(content)
112115

113116
def candiate_words(self, stripped_input):
114-
return stripped_input.split(' ')
117+
return re.split(SPACE_SYMBOLS, stripped_input)
115118

116119
def get_stopword_count(self, content):
117120
if not content:

0 commit comments

Comments
 (0)