-
Notifications
You must be signed in to change notification settings - Fork 484
WIP: Reimplementing search_dates
#945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
gavishpoddar
wants to merge
44
commits into
scrapinghub:master
Choose a base branch
from
gavishpoddar:search_dates
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 11 commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
02220da
Implimenting new search_dates
gavishpoddar f933d3a
Fixing DATE_ORDER, implimenting deep_search, tests
gavishpoddar 77727b5
Unproving _joint_parse with data_carry accurate_return_text, deep_se…
gavishpoddar e7f38e8
implementing _final_text_clean()
gavishpoddar 962066c
Simplifying text_clean and modifying tests
gavishpoddar 624ac8e
Implementing relative date
gavishpoddar 42ca6f6
Fixing tests
gavishpoddar 51749a2
secondary_split_implimentation
gavishpoddar f5e4635
positional args to keyword argument
gavishpoddar 121b15f
Micro fixes
gavishpoddar 2cd93f0
Removing codes now part of #953
gavishpoddar 006d2a5
adding check_settings
gavishpoddar 10404c9
implimenting double_punctuation_split
gavishpoddar 22596e0
Updating docs and removing test (TMP)
gavishpoddar b799dfb
cleaning code, adding tests, improving coverage
gavishpoddar 42c984a
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar 8fc5e0d
Improving codecov
gavishpoddar 74b6ec4
temporary commit to get diff
gavishpoddar 56e0505
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar 5a1b1c5
temporary file change for review
gavishpoddar aa2aa8f
reverting the previous commit
gavishpoddar 41eff6a
improvements
gavishpoddar f65531b
formatting code
gavishpoddar 982fc08
formatting code
gavishpoddar 3621b2d
improvements in text filter
gavishpoddar 8a9496b
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar 45996b4
removing previous search_dates
gavishpoddar 2ac88c6
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar 5dabc62
adding test
gavishpoddar ab1778d
fixing doc string
gavishpoddar 14adf89
fixing doc string
gavishpoddar d57223a
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar 88afa30
updating xfail
gavishpoddar 9209f3d
updating tests
gavishpoddar 85254e0
Apply suggestions from code review
gavishpoddar e4604e6
Merge branch 'master' into search_dates
gavishpoddar 4f119dd
Updates
gavishpoddar e6da4be
Fixing upstraem merges
gavishpoddar f6116bf
DateSearch -> DateSearchWithDetection
gavishpoddar 0525cdc
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar 96b91c0
updating test with xfail
gavishpoddar b9d12f3
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar 99e66c6
minor fixes
gavishpoddar 2935aae
Merge branch 'master' into search_dates
serhii73 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| from dateparser.search_dates.search import DateSearch | ||
| from dateparser.conf import apply_settings | ||
|
|
||
|
|
||
| _search_dates = DateSearch() | ||
|
|
||
|
|
||
| @apply_settings | ||
| def search_dates(text, languages=None, settings=None): | ||
| result = _search_dates.search_dates( | ||
| text=text, languages=languages, settings=settings | ||
| ) | ||
|
|
||
| dates = result.get('Dates') | ||
| if not dates: | ||
| return None | ||
| return dates | ||
|
|
||
|
|
||
| @apply_settings | ||
| def search_first_date(text, languages=None, settings=None): | ||
| result = _search_dates.search_dates( | ||
| text=text, languages=languages, limit_date_search_results=1, settings=settings | ||
| ) | ||
| dates = result.get('Dates') | ||
| if not dates: | ||
| return None | ||
| return dates |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| from collections.abc import Set | ||
|
|
||
| from dateparser.search.text_detection import FullTextLanguageDetector | ||
| from dateparser.languages.loader import LocaleDataLoader | ||
|
|
||
|
|
||
| class SearchLanguages: | ||
| def __init__(self) -> None: | ||
| self.loader = LocaleDataLoader() | ||
| self.available_language_map = self.loader.get_locale_map() | ||
| self.language = None | ||
|
|
||
| def get_current_language(self, language_shortname): | ||
| if self.language is None or self.language.shortname != language_shortname: | ||
| self.language = self.loader.get_locale(language_shortname) | ||
|
|
||
| def translate_objects(self, language_shortname, text, settings): | ||
| self.get_current_language(language_shortname) | ||
| result = self.language.translate_search(text, settings=settings) | ||
| return result | ||
|
|
||
| def detect_language(self, text, languages): | ||
| if isinstance(languages, (list, tuple, Set)): | ||
|
|
||
| if all([language in self.available_language_map for language in languages]): | ||
| languages = [self.available_language_map[language] for language in languages] | ||
| else: | ||
| unsupported_languages = set(languages) - set(self.available_language_map.keys()) | ||
| raise ValueError( | ||
| "Unknown language(s): %s" % ', '.join(map(repr, unsupported_languages))) | ||
| elif languages is not None: | ||
| raise TypeError("languages argument must be a list (%r given)" % type(languages)) | ||
|
|
||
| if languages: | ||
| self.language_detector = FullTextLanguageDetector(languages=languages) | ||
| else: | ||
| self.language_detector = FullTextLanguageDetector(list(self.available_language_map.values())) | ||
|
|
||
| return self.language_detector._best_language(text) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,211 @@ | ||
| import re | ||
| from typing import List, Dict | ||
|
|
||
| from dateparser.conf import apply_settings, Settings | ||
| from dateparser.date import DateDataParser | ||
| from dateparser.search_dates.languages import SearchLanguages | ||
|
|
||
| _drop_words = {'on', 'of'} # cause annoying false positives | ||
| _bad_date_re = re.compile( | ||
| # whole dates we black-list (can still be parts of valid dates) | ||
| "^(" | ||
| + "|".join( | ||
| [ | ||
| r"\d{1,3}", # less than 4 digits | ||
| r"#\d+", # this is a sequence number | ||
| # some common false positives below | ||
| r"[-/.]+", # bare separators parsed as current date | ||
| r"\w\.?", # one letter (with optional dot) | ||
| "an", | ||
| ] | ||
| ) | ||
| + ")$" | ||
| ) | ||
|
|
||
| _secondary_splitters = [',', '،', '——', '—', '–', '.'] # are used if no date object is found | ||
|
|
||
|
|
||
| def _get_relative_base(already_parsed): | ||
| if already_parsed: | ||
| return already_parsed[-1][1] | ||
| return None | ||
|
|
||
|
|
||
| def _create_splits(text): | ||
| splited_objects = text.split() | ||
| splited_objects = [p for p in splited_objects if p and p not in _drop_words] | ||
| return splited_objects | ||
|
|
||
|
|
||
| def _create_joined_parse(text, max_join=7, sort_ascending=False): | ||
| split_objects = _create_splits(text=text) | ||
| joint_objects = [] | ||
| for i in range(len(split_objects)): | ||
| for j in reversed(range(min(max_join, len(split_objects) - i))): | ||
| x = " ".join(split_objects[i:i + j + 1]) | ||
| if _bad_date_re.match(x): | ||
| continue | ||
| if not len(x) > 2: | ||
| continue | ||
| joint_objects.append(x) | ||
|
|
||
| if sort_ascending: | ||
| joint_objects = sorted(joint_objects, key=len) | ||
|
|
||
| return joint_objects | ||
|
|
||
|
|
||
| def _get_accurate_return_text(text, parser, datetime_object): | ||
| text_candidates = _create_joined_parse(text=text, sort_ascending=True) | ||
| for text_candidate in text_candidates: | ||
| if parser.get_date_data(text_candidate).date_obj == datetime_object: | ||
| return text_candidate | ||
|
|
||
|
|
||
| def _joint_parse(text, parser, translated=None, deep_search=True, accurate_return_text=False, data_carry=None): | ||
| if not text: | ||
| return data_carry | ||
|
|
||
| elif not len(text) > 2: | ||
| return data_carry | ||
|
|
||
| elif translated and len(translated) <= 2: | ||
| return data_carry | ||
|
|
||
| reduced_text_candidate = None | ||
| secondary_split_made = False | ||
| returnable_objects = data_carry or [] | ||
| joint_based_search_dates = _create_joined_parse(text=text) | ||
| for date_object_candidate in joint_based_search_dates: | ||
| parsed_date_object = parser.get_date_data(date_object_candidate) | ||
| if parsed_date_object.date_obj: | ||
| if accurate_return_text: | ||
| date_object_candidate = _get_accurate_return_text( | ||
| text=date_object_candidate, parser=parser, datetime_object=parsed_date_object.date_obj | ||
| ) | ||
|
|
||
| returnable_objects.append( | ||
| (date_object_candidate.strip(" .,:()[]-'"), parsed_date_object.date_obj) | ||
| ) | ||
|
|
||
| if deep_search: | ||
| start_index = text.find(date_object_candidate) | ||
| end_index = start_index + len(date_object_candidate) | ||
| if start_index < 0: | ||
| break | ||
| reduced_text_candidate = text[:start_index] + text[end_index:] | ||
| break | ||
| else: | ||
| for splitter in _secondary_splitters: | ||
| secondary_split = re.split('(?<! )[' + splitter + ']+(?! )', date_object_candidate) | ||
| if secondary_split and len(secondary_split) > 1: | ||
| reduced_text_candidate = " ".join(secondary_split) | ||
| secondary_split_made = True | ||
|
|
||
| if (deep_search or secondary_split_made) and not text == reduced_text_candidate: | ||
| if reduced_text_candidate and len(reduced_text_candidate) > 2: | ||
| returnable_objects = _joint_parse( | ||
| text=reduced_text_candidate, | ||
| parser=parser, | ||
| data_carry=returnable_objects | ||
| ) | ||
|
|
||
| return returnable_objects | ||
|
|
||
|
|
||
| class DateSearch: | ||
| def __init__(self, make_joints_parse=True, default_language="en"): | ||
| self.make_joints_parse = make_joints_parse | ||
| self.default_language = default_language | ||
|
|
||
| self.search_languages = SearchLanguages() | ||
|
|
||
| @apply_settings | ||
| def search_parse( | ||
| self, text, language_shortname, settings, limit_date_search_results=None | ||
| ) -> List[tuple]: | ||
gavishpoddar marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| returnable_objects = [] | ||
| parser = DateDataParser(languages=[language_shortname], settings=settings) | ||
| translated, original = self.search_languages.translate_objects( | ||
| language_shortname, text, settings | ||
| ) | ||
|
|
||
| for index, original_object in enumerate(original): | ||
| if limit_date_search_results and returnable_objects: | ||
| if len(returnable_objects) == limit_date_search_results: | ||
| break | ||
|
|
||
| if not len(original_object) > 2: | ||
| continue | ||
|
|
||
| if not settings.RELATIVE_BASE: | ||
| relative_base = _get_relative_base(already_parsed=returnable_objects) | ||
| if relative_base: | ||
| parser._settings.RELATIVE_BASE = relative_base | ||
|
|
||
| if self.make_joints_parse: | ||
| joint_based_search_dates = _joint_parse( | ||
| text=original_object, parser=parser, translated=translated[index] | ||
| ) | ||
| if joint_based_search_dates: | ||
| returnable_objects.extend(joint_based_search_dates) | ||
| else: | ||
| parsed_date_object = parser.get_date_data(original_object) | ||
| if parsed_date_object.date_obj: | ||
| returnable_objects.append( | ||
| (original_object.strip(" .,:()[]-'"), parsed_date_object.date_obj) | ||
| ) | ||
|
|
||
| parser._settings = Settings() | ||
| return returnable_objects | ||
|
|
||
| @apply_settings | ||
| def search_dates( | ||
| self, text, languages=None, limit_date_search_results=None, settings=None | ||
| ) -> Dict: | ||
| """ | ||
| Find all substrings of the given string which represent date and/or time and parse them. | ||
|
|
||
| :param text: | ||
| A string in a natural language which may contain date and/or time expressions. | ||
| :type text: str | ||
|
|
||
| :param languages: | ||
| A list of two letters language codes.e.g. ['en', 'es']. If languages are given, it will not attempt | ||
| to detect the language. | ||
| :type languages: list | ||
|
|
||
| :param limit_date_search_results: | ||
| A int which sets maximum results to be returned. | ||
| :type limit_date_search_results: int | ||
|
|
||
| :param settings: | ||
| Configure customized behavior using settings defined in :mod:`dateparser.conf.Settings`. | ||
| :type settings: dict | ||
|
|
||
| :return: a dict mapping keys to two letter language code and a list of tuples of pairs: | ||
| substring representing date expressions and corresponding :mod:`datetime.datetime` object. | ||
| For example: | ||
| {'Language': 'en', 'Dates': [('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0))]} | ||
| If language of the string isn't recognised returns: | ||
| {'Language': None, 'Dates': None} | ||
| :raises: ValueError - Unknown Language | ||
| """ | ||
|
|
||
| language_shortname = ( | ||
| self.search_languages.detect_language(text=text, languages=languages) | ||
| or self.default_language | ||
| ) | ||
|
|
||
| if not language_shortname: | ||
| return {"Language": None, "Dates": None} | ||
| return { | ||
| "Language": language_shortname, | ||
| "Dates": self.search_parse( | ||
| text=text, | ||
| language_shortname=language_shortname, | ||
| limit_date_search_results=limit_date_search_results, | ||
| settings=settings, | ||
| ), | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| from dateparser.search_dates import search_dates | ||
|
|
||
| # THIS IS TEMPORARY FILE FOR TESTS | ||
|
|
||
| text = """10 Febbraio 2020 15:00 ciao moka""" | ||
|
|
||
| out1 = search_dates(text) | ||
| print(out1) | ||
|
|
||
|
|
||
|
|
||
| # tox -e py -- tests/test_search_dates.py |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.