EasyTXT is a set of high and low level modules to help you with text normalization and manipulation.
PLEASE NOTE: EasyTXT is still in alpha stage and certain functionalities could change without deprecation warning, although in the current stage this is less likely and class parameters should remain the same. For now it's discouraged to use it in production (if so, then on your own risk) and it's current usage is for testing purposes only.
Contents
Some of the most important features that EasyTXT does:
- normalizes text
- break text into normalized sentences
- break text into normalized features
- converts HTML to normalized text
- text manipulation (allow or deny sentences, etc.)
- fixes text encoding
- normalizes spaces
- converts html table data into sentences or features
- html table parser which returns dict of column row info
- autocomplete works with any method or function :)
- ...
There are many more features regarding which, please refer to the documentation bellow.
pip install easytxt
easytxt requires Python 3.8+.
In this example lets parse badly structured text and output it into a multiple formats.
Please note that calling multiple formats at the same time won't affect performance since sentences are cached and when calling other formats, cached sentences will be instead used in a process.
>>> from easytxt import parse_text
>>> test_text = ' first sentence... Bad ünicode. HTML entities <3!'
>>> pt = parse_text(test_text)
>>> pt.sentences
['First sentence...', 'Bad ünicode.', 'HTML entities <3!']
Lets just get normalized text.
>>> pt.text
First sentence... Bad ünicode. HTML entities <3!
Here is example how to extract features from text.
>>> test_text = '- color: Black - material: Aluminium. Last Sentence'
>>> pt = parse_text(test_text)
The text parser will try to automatically detect which are regular sentences or
features and show only extracted features when called features
attribute. By
default features would get capitalized in a same way as sentences.
>>> pt.features
[('Color', 'Black'), ('Material', 'Aluminium')]
Return features dictionary instead a list of tuples.
>>> pt.features_dict
{'Color': 'Black', 'Material': 'Aluminium'}
Let's get a value from a specific feature.
>>> pt.feature('color')
Black
We don't need to call ``features`` property first to get value with ``feature`` since this is already done in a background. Features are also cached in a similar way as sentences to increase performance in a case we make multiple calls.
Although regular sentences are ignored when calling features
attr, they can
still be returned when calling sentences
or text
attr.
>>> pt.sentences
['Color: Black.', 'Material: Aluminium.', 'Last Sentence.']
>>> pt.text
Color: Black. Material: Aluminium. Last Sentence.
In this example we will try to parse html text. There is not special parameter for
parse_text
in order to process HTML. Usage is exactly the same as for
regular text
since html
is detected and processed automatically.
>>> test_text = '<p>Some sentence</p> <ul><li>* Easy <b>HD</b> camera </li></ul>'
>>> pt = parse_text(test_text)
>>> pt.sentences
['Some sentence.', 'Easy HD camera.']
One of the best features of using parse_text
on html
is that it can extract
table data into sentences. Lets get more info about this feature through example.
from easytxt import parse_text
test_text_html = '''
<p>Some paragraph demo text.</p>
<table>
<tbody>
<tr>
<td scope="row">Type</td>
<td>Easybook Pro</td>
</tr>
<tr>
<td scope="row">Operating system</td>
<td>etOS</td>
</tr>
</tbody>
</table>
<div>Text after <strong>table</strong>.</div>
'''
tp = parse_text(test_text_html)
print(tp.sentences)
In example above following sentences will be printed.
[
'Some paragraph demo text.',
'Type: Easybook Pro.',
'Operating system: etOS.',
'Text after table.'
]
Although in example we used table without header and with only two columns,
parse_text
can easily handle tables with a header and more than two columns.
Although it can parse table with infinite number of columns, it's not advised to
parse_text
since sentences with table data would become hard to read. To
extract data from a table with more complex structure parse_table
is recommended
to be used since it can return results as a list of dictionaries.
language
If we are parsing text in other language than english then we need to specify language parameter to which language our text belong to in order for sentences to be split properly around abbreviations.
>>> test_text = 'primera oracion? Segunda oración. tercera oración'
>>> pt = parse_text(test_text, language='es')
>>> pt.sentences
['Primera oracion?', 'Segunda oración.', 'Tercera oración.']
Please note that currently only en
and es
language parameter values
are supported. Support for more is coming soon...
css_query
In cases that we provide html string, we can with css_query
parameter
select from which html nodes text would get extracted.
>>> test_text = '<p>Some sentence</p> <ul><li>* Easy <b>HD</b> camera </li></ul>'
>>> pt = parse_text(test_text, css_query='p')
>>> pt.sentences
['Some sentence.']
exclude_css
In cases that we provide html string, we can through exclude_css
parameter
limit from which html nodes would be excluded from parsing.
>>> test_text = '<p>Some sentence</p> <ul><li>* Easy <b>HD</b> camera </li></ul>'
>>> pt = parse_text(test_text, exclude_css=['p', 'b'])
>>> pt.sentences
['Easy camera.']
allow
We can control which sentences we want to get extracted by providing list of
keywords into allow
parameter. Keys are not case sensitive.
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> pt = parse_text(test_text, allow=['first', 'third'])
>>> pt.sentences
['First sentence?', 'Third sentence.']
Regex pattern is also supported as parameter value:
>>> pt = parse_text(test_text, allow=[r'\bfirst'])
callow
callow
is similar to allow
but with exception that provided keys
are case sensitive. Regex pattern as key is also supported.
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> pt = parse_text(test_text, allow=['First', 'Third'])
>>> pt.sentences
['Third sentence.']
from_allow
We can skip sentences by providing keys in from_allow
parameter.
Keys are not case sensitive and regex pattern is also supported.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> pt = parse_text(test_text, from_allow=['second'])
>>> pt.sentences
['Second txt.', 'Third Txt.', 'FOUR txt.']
from_callow
from_callow
is similar to from_allow
but with exception that
provided keys are case sensitive. Regex pattern as key is also supported.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> pt = parse_text(test_text, from_callow=['Second'])
>>> pt.sentences
['Second txt.', 'Third Txt.', 'FOUR txt.']
Lets recreate same example as before but with lowercase key.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> pt = parse_text(test_text, from_callow=['second'])
>>> pt.sentences
[]
to_allow
to_allow
is similar to from_allow
but in reverse order. Here
are sentences skipped after provided key is found. Keys are not case
sensitive and regex pattern is also supported.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> pt = parse_text(test_text, to_allow=['four'])
>>> pt.sentences
['First txt.', 'Second txt.', 'Third Txt.']
to_callow
to_callow
is similar to to_allow
but with exception that
provided keys are case sensitive. Regex pattern is also supported.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> pt = parse_text(test_text, to_callow=['FOUR'])
>>> pt.sentences
['First txt.', 'Second txt.', 'Third Txt.']
Lets recreate same example as before but with lowercase key.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> pt = parse_text(test_text, to_callow=['four'])
>>> pt.sentences
['First txt.', 'Second txt.', 'Third Txt.', 'FOUR txt.']
deny
We can control which sentences we don't want to get extracted by providing
list of keywords into deny
parameter. Keys are not case sensitive and
regex pattern is also supported.
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> pt = parse_text(test_text, deny=['first', 'third'])
>>> pt.sentences
['Second sentence.']
cdeny
cdeny
is similar to deny
but with exception that provided keys
are case sensitive. Regex pattern as a key is also supported.
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> pt = parse_text(test_text, deny=['First', 'Third'])
>>> pt.sentences
['First sentence?', 'Second sentence.']
normalize
By default parameter normalize
is set to True
. This means that any
bad encoding will be automatically fixed, stops added and line breaks
split into sentences.
>>> from easytxt import parse_text
>>> test_text = ' first sentence... Bad ünicode. HTML entities <3!'
>>> pt = parse_text(test_text)
>>> pt.sentences
['First sentence...', 'Bad ünicode.', 'HTML entities <3!']
Lets try to set parameter normalize
to False
and see what happens.
>>> from easytxt import parse_text
>>> test_text = ' first sentence... Bad ünicode. HTML entities <3!'
>>> pt = parse_text(test_text, normalize=False)
>>> pt.sentences
['First sentence...', 'Bad ünicode.', 'HTML entities <3!']
capitalize
By default all sentences will get capitalized as we can see bellow.
>>> test_text = 'first sentence? Second sentence. third sentence'
>>> pt = parse_text(test_text)
>>> pt.sentences
['First sentence?', 'Second sentence.', 'third sentence.']
We can disable this behaviour by setting parameter capitalize
to False
.
>>> test_text = 'first sentence? Second sentence. third sentence'
>>> pt = parse_text(test_text, capitalize=False)
>>> pt.sentences
['first sentence?', 'Second sentence.', 'third sentence.']
title
We can set our text output to title by setting parameter title
to True
.
>>> test_text = 'first sentence? Second sentence. third sentence'
>>> pt = parse_text(test_text, title=True)
>>> pt.text
'First Sentence? Second Sentence. Third Sentence'
uppercase
We can set our text output to uppercase by setting parameter uppercase
to True
.
>>> test_text = 'first sentence? Second sentence. third sentence'
>>> pt = parse_text(test_text, uppercase=True)
>>> pt.sentences
['FIRST SENTENCE?', 'SECOND SENTENCE.', 'THIRD SENTENCE.']
lowercase
We can set our text output to lowercase by setting parameter lowercase
to True
.
>>> test_text = 'first sentence? Second sentence. third sentence'
>>> pt = parse_text(test_text, lowercase=True)
>>> pt.text
'first sentence? second sentence. third sentence'
min_chars
By default min_chars
has a value of 5. This means that any sentence that has
less than 5 chars, will be filtered out and not seen at the end result. This
is done to remove ambiguous sentences, especially when extracting text from
html. We can raise or decrease this limit by changing the value of min_chars
.
replace_keys
We can replace all chars in a sentences by providing tuple of search key and
replacement char in a replace_keys
parameter. Regex pattern as key is
also supported and search keys are not case sensitive.
>>> test_text = 'first sentence! - second sentence. Third'
>>> pt = parse_text(test_text, replace_keys=[('third', 'Last'), ('nce!', 'nce?')])
>>> pt.sentences
['First sentence?', 'Second sentence.', 'Last.']
remove_keys
We can remove all chars in sentences by providing list of search keys in a
replace_keys
parameter. Regex pattern as key is also supported and keys
are not case sensitive.
>>> test_text = 'first sentence! - second sentence. Third'
>>> pt = parse_text(test_text, remove_keys=['sentence', '!'])
>>> pt.sentences
['First.', 'Second.', 'Third.']
replace_keys_raw_text
We can replace char values before text is split into sentences. This is
especially useful if we want to fix text before it's parsed and so that
is split into sentences correctly. It accepts regex
as key values in a
tuple
. Please note that keys are not case sensitive and regex as key
is also accepted.
Lets first show default result with badly structured text without
setting keys into replace_keys_raw_text
.
>>> test_text = 'Easybook pro 15 Color: Gray Material: Aluminium'
>>> pt = parse_text(test_text)
>>> pt.sentences
['Easybook pro 15 Color: Gray Material: Aluminium.']
As we can see from the result test text is returned as one sentence
due to missing stop keys (.
) between sentences. Lets fix this by
adding stop keys into unprocessed text before sentence splitting
happens.
>>> test_text = 'Easybook pro 15 Color: Gray Material: Aluminium'
>>> replace_keys = [('Color:', '. Color:'), ('Material:', '. Material:')]
>>> pt = parse_text(test_text, replace_keys_raw_text=replace_keys)
>>> pt.sentences
['Easybook pro 15.', 'Color: Gray.', 'Material: Aluminium.']
remove_keys_raw_text
Works similar as replace_keys_raw_text
, but instead of providing list
of tuples in order to replace chars, here we provide list of chars to remove
keys. Lets try first on a sentence without setting keys to rremove_keys_raw_text
.
Please note that keys are not case sensitive and regex as key is also accepted.
>>> test_text = 'Easybook pro 15. Color: Gray'
>>> pt = parse_text(test_text)
>>> pt.sentences
['Easybook pro 15.', 'Color: Gray.']
Text above due to stop key .
was split into two sentences. Lets prevent this
by removing color and stop key at the same time and get one sentence instead.
>>> test_text = 'Easybook pro 15. Color: Gray'
>>> pt = parse_text(test_text, remove_keys_raw_text=['. color:'])
>>> pt.sentences
['Easybook pro 15 Gray.']
split_inline_breaks
By default text with chars like *
, `` - `` and bullet points would get split
into sentences.
Example:
>>> test_text = '- first param - second param'
>>> pt = parse_text(test_text)
>>> pt.sentences
['First param.', 'Second param.']
In cases when we want to disable this behaviour, we can set parameter
split_inline_breaks
to False
.
>>> test_text = '- first param - second param'
>>> pt = parse_text(test_text, split_inline_breaks=False)
>>> pt.sentences
['- first param - second param.']
Please note that chars like .
, :
, ?
, !
are not considered
as inline breaks.
inline_breaks
In above example we saw how default char breaks by default work. In cases when
we want to split sentences by different char than default one, we can do so by
providing list of chars into inline_breaks
parameter.
>>> test_text = '> first param > second param'
>>> pt = parse_text(test_text, inline_breaks=['>'])
>>> pt.sentences
['First param.', 'Second param.']
Regex pattern is also supported as parameter value:
>>> parse_text(test_text, inline_breaks=[r'\b>'])
stop_key
If a sentence is without a stop key at the end, then by default it
will automatically be appended .
. Let see this in bellow example:
>>> test_text = 'First feature <br> second feature?'
>>> pt = parse_text(test_text)
>>> pt.sentences
['First feature.', 'Second feature?']
We can change our default char .
to a custom one by setting our
desired char in a stop_key
parameter.
>>> test_text = 'First feature <br> second feature?'
>>> pt = parse_text(test_text, stop_key='!')
>>> pt.sentences
['First feature!', 'Second feature?']
sentence_separator
In cases when we want output in text format, we can change how sentences are merged together.
Lets see first default output in example bellow:
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> pt = parse_text(test_text)
>>> pt.text
First sentence? Second sentence. Third sentence.
Behind the scene simple join
on a list of sentences is performed.
Now lets change default value ' '
of sentence_separator
to our
custom one.
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> pt = parse_text(test_text, sentence_separator=' > ')
>>> pt.text
First sentence? > Second sentence. > Third sentence.
text_num_to_numeric
We can convert all alpha chars that describe numeric values to actual
numbers by setting text_num_to_numeric
parameter to True
.
>>> test_text = 'First Sentence. Two thousand and three has it. Three Sentences.'
>>> pt = parse_text(test_text, text_num_to_numeric=True)
>>> pt.sentences
['1 Sentence.', '2003 has it.', '3 Sentences.']
If our text is in different language we need to change language value in
our language
parameter. Currently supported languages regarding
text_num_to_numeric
are only en, es, hi and ru
.
For examples bellow we will use following code as basis:
>>> test_text = 'First txt. Second txt.'
>>> pt = parse_text(test_text)
__str__
Normally we would get text by calling text
property:
>>> pt.text
'First txt. Second txt.'
But we can avoid calling text
property by str
casting.
>>> str(pt)
'First txt. Second txt.'
__iter__
Normally we would get sentences by calling sentence
property:
>>> pt.sentences
['First txt.', 'Second txt.']
But we can avoid calling sentence
property and use it directly
in iteration.
>>> [sentence for sentence in pt]
['First txt.', 'Second txt.']
Another alternative:
>>> list(pt)
['First txt.', 'Second txt.']
__add__
>>> pt + 'hello world'
>>> pt.sentences
['First txt.', 'Second txt.', 'Hello World.']
>>> pt + ['Hello', 'World!']
>>> pt.sentences
['First txt.', 'Second txt.', 'Hello', 'World!']
__radd__
>>> 'hello world' + pt
>>> pt.sentences
['Hello World.', 'First txt.', 'Second txt.']
>>> ['Hello', 'World!'] + pt
>>> pt.sentences
['Hello', 'World!', 'First txt.', 'Second txt.', 'Hello World.']
parse_string
is a helper method to normalize and manipulate simple
texts like titles or similar. It's also more performant than parse_text
since it doesn't perform sentence split, capitalization by default ...
Basically it accepts str
, float
, int
and returns normalized string.
In this example lets process text with bad encoding.
>>> from easytxt import parse_string
>>> test_text = 'Easybook Pro 13 <3 ünicode'
>>> parse_string(test_text)
Easybook Pro 13 <3 ünicode
Floats, integers will get transformed to string automatically.
>>> test_int = 123
>>> parse_string(test_text)
'123'
>>> test_float = 123.12
>>> parse_string(test_text)
'123.12'
normalize
As seen in example above, text normalization (bad encoding) is
enabled by default through normalize
parameter. Lets set normalize
parameter to False
to disable text normalization.
>>> test_text = 'Easybook Pro 13 <3 ünicode'
>>> parse_string(test_text)
Easybook Pro 13 <3 ünicode
capitalize
We can capitalize first character in our string if needed by setting
capitalize
parameter to True
. By default is set to False
.
>>> test_text = 'easybook PRO 15'
>>> parse_string(test_text, capitalize=True)
Easybook PRO 15
title
We can set all first chars in a word uppercase while other chars in a word
become lowercase with``title`` parameter set to True
.
>>> test_text = 'easybook PRO 15'
>>> parse_string(test_text, title=True)
Easybook Pro 15
uppercase
We can set all chars in our string to uppercase by uppercase
parameter set to True
.
>>> test_text = 'easybook PRO 15'
>>> parse_string(test_text, uppercase=True)
EASYBOOK PRO 15
lowercase
We can set all chars in our string to lowercase by lowercase
parameter set to True
.
>>> test_text = 'easybook PRO 15'
>>> parse_string(test_text, lowercase=True)
easybook pro 15
replace_keys
We can replace chars/words in a string through replace_chars
parameter.
replace_chars
can accept regex pattern as a lookup key and is not
case sensitive.
>>> test_text = 'Easybook Pro 15'
>>> parse_string(test_text, replace_keys=[('pro', 'Air'), ('15', '13')])
Easybook Air 13
remove_keys
We can remove chars/words in a string through remove_keys
parameter.
remove_keys
can accept regex pattern as a lookup key and is not
case sensitive.
>>> test_text = 'Easybook Pro 15'
>>> parse_string(test_text, remove_keys=['easy', 'pro'])
book 15
split_key
Text can be split by split_key
. By default split index is 0
.
>>> test_text = 'easybook-pro_13'
>>> parse_string(test_text, split_key='-')
easybook
Lets specify split index through tuple.
>>> test_text = 'easybook-pro_13'
>>> parse_string(test_text, split_key=('-', -1))
pro_13
split_keys
split_keys
work in a same way as split_key
but instead of single
split key it accepts list of keys.
>>> test_text = 'easybook-pro_13'
>>> parse_string(test_text, split_keys=[('-', -1), '_'])
pro
take
With take
parameter we can limit maximum number that are shown
at the end result. Lets see how it works in example bellow.
>>> test_text = 'Easybook Pro 13'
>>> parse_string(test_text, take=8)
Easybook
take
With skip
parameter we can skip ignore defined number of chars.
Lets see how it works in example bellow.
>>> test_text = 'Easybook Pro 13'
>>> parse_string(test_text, skip=8)
Pro 13
text_num_to_numeric
We can convert all alpha chars that describe numeric values to actual
numbers by setting text_num_to_numeric
parameter to True
.
>>> test_text = 'two thousand and three words for the first time'
>>> parse_string(test_text, text_num_to_numeric=True)
2003 words for the 1 time
If our text is in different language we need to change language value in
our language
parameter. Currently supported languages are only
en, es, hi and ru
.
fix_spaces
By default all multiple spaces will be removed and left with only single one between chars. Lets test it in our bellow example:
>>> test_text = 'Easybook Pro 15'
>>> parse_string(test_text)
Easybook Pro 15
Now lets change fix_spaces
parameter to False
and see what happens.
>>> test_text = 'Easybook Pro 15'
>>> parse_string(test_text, fix_spaces=False)
Easybook Pro 15
escape_new_lines
By default all new line characters are converted to empty space as we can see in example bellow:
>>> test_text = 'Easybook\nPro\n15'
>>> parse_string(test_text)
Easybook Pro 15
Now lets change escape_new_lines
parameter to False
and see what happens.
>>> test_text = 'Easybook\nPro\n15'
>>> parse_string(test_text, escape_new_lines=False)
Easybook\nPro\n15
new_line_replacement
If escape_new_lines
is set to True
, then by default all new line chars
will be replaced by ' '
as seen in upper example. We can change this
default setting by changing value of new_line_replacement
parameter.
>>> test_text = 'Easybook\nPro\n15'
>>> parse_string(test_text, new_line_replacement='<br>')
Easybook<br>Pro<br>15
add_stop
We can add stop char at the end of the string by setting add_stop
parameter to True
.
>>> test_text = 'Easybook Pro 15'
>>> parse_string(test_text, add_stop=True)
Easybook Pro 15.
By default .
is added but we can provide our custom char if needed. Instead
of setting add_stop
parameter to True
, we can instead of boolean value
provide char as we can see in example bellow.
>>> test_text = 'Easybook Pro 15'
>>> parse_string(test_text, add_stop='!')
Easybook Pro 15!
parse_table
parses/extracts data from HTML
table into various formats
like dict
, list
or just ordinary text
.
Please note that parse_text
already parses html tables but only in
list
or text
format and will extract also text from other nodes
if css
selector is not set directly on table
node.
In following examples we will use two tables. One with a header and one without it.
from easytxt import parse_table
test_text_html = '''
<p>Some paragraph demo text.</p>
<table>
<tbody>
<tr>
<td scope="row">Type</td>
<td>Easybook Pro</td>
</tr>
<tr>
<td scope="row">Operating system</td>
<td>etOS</td>
</tr>
</tbody>
</table>
<div>Text after <strong>table</strong>.</div>
'''
pt = parse_table(test_text_html)
for row in pt:
print(row)
In example above following row data will be printed.
{'Type': 'Easybook Pro'}
{'Operating system': 'etOS'}
Alternatively we can get data also as sentences.
print(pt.sentences)
[
'Type: Easybook Pro',
'Operating system: etOS'
]
Or a text.
print(pt.text)
* Type: Easybook Pro * Operating system: etOS
As we can see, only table html will be extracted and by design other html nodes
are ignored, so that any ambiguous text isn't processed. If header isn't explicitly
specified with a th
or a thead
nodes, then parse_table
will automatically
assume that provided table is without header data and it will take values from first
column as header info.
Lets make a test on a more complex table with a header and multiple columns.
from easytxt import parse_table
test_text_html = '''
<table>
<tr>
<th>Type</th>
<th>OS</th>
<th>Color</th>
</tr>
<tr>
<td>Easybook 15</td>
<td>etOS</td>
<td>Gray</td>
</tr>
<tr>
<td>Easyphone x1</td>
<td>Mobile etOS</td>
<td>Black</td>
</tr>
<tr>
<td>Easywatch abc</td>
<td>Mobile etOS</td>
<td>Blue</td>
</tr>
</table>
'''
pt = parse_table(test_text_html)
for row in pt:
print(row)
In example above following row data will be printed.
{'Type': 'Easybook 15', 'OS': 'etOS', 'Color': 'Gray'}
{'Type': 'Easyphone x1', 'OS': 'Mobile etOS', 'Color': 'Black'}
{'Type': 'Easywatch abc', 'OS': 'Mobile etOS', 'Color': 'Blue'}
Lets get table data printed as sentences.
print(pt.sentences)
[
'Type/OS/Color: Easybook 15/etOS/Gray',
'Type/OS/Color: Easyphone x1/Mobile etOS/Black',
'Type/OS/Color: Easywatch abc/Mobile etOS/Blue'
]
Or a text.
print(pt.text)
* Type/OS/Color: Easybook 15/etOS/Gray * Type/OS/Color: Easyphone x1/Mobile etOS/Black * Type/OS/Color: Easywatch abc/Mobile etOS/Blue
Lets get header keys only. It only works in a table with header nodes.
print(pt.headers)
['Type', 'OS', 'Color']
examples coming soon ... For now please refer to the source code
EasyTXT relies on following libraries in some ways:
- ftfy to fix encoding.
- pyquery to help with html to text conversion.
- number-parser to help with numeric text to number conversion
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Report bugs at https://github.com/sitegroove/easytxt/issues.
If you are reporting a bug, please include:
- Your operating system name and
EasyTXT
package version. - Whole text sample that is being parsed and custom parameters if being set.
- Parsed text result in various formats
text
,senteces
,features
.
Look through the GitHub issues for bugs. Anything tagged with “bug” is open to whoever wants to implement it.
Look through the GitHub issues for features. Anything tagged with “feature” is open to whoever wants to implement it. We encourage you to add new test cases to existing stack.
EasyTXT
could always use more documentation, whether as part of the
official EasyTXT
docs or even on the web in blog posts, articles,
tutorials, and such.
The best way to send feedback is to file an issue at https://github.com/sitegroove/easytxt/issues.
If you are proposing a feature:
- Explain in detail how it would work.
- Keep the scope as narrow as possible, to make it easier to implement.
- Remember that contributions are welcome :)
Before you submit a pull request, check that it meets these guidelines:
- The pull request should include tests unless PR contains only changes to docs.
- If the pull request adds functionality, the docs should be updated. Docs currently live in a README.rst file.
- Follow the core developers’ advice which aim to ensure code’s consistency regardless of variety of approaches used by many contributors.
- In case you are unable to continue working on a PR, please leave a short comment to notify us. We will be pleased to make any changes required to get it done.
Note: Contributing section was heavily inspired by dateparser package contributing guidelines.