-
Notifications
You must be signed in to change notification settings - Fork 18
handle_urls decorator using a new PageObjectRegistry #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 19 commits
a5a8f42
ec80b69
308bd1d
aa8000d
a2d5cb6
1f1f410
bdb8987
a3e3eea
ef9945b
f6fdac4
b050d01
ba52ce0
ba61626
f5cffef
c3579b9
234b8d9
531752f
7495b58
0a0ee12
46d40e7
495642b
75593ed
0a2d779
c000cbc
10dff5b
daa3ff9
3b05c07
f626efc
bd3a88e
0cbeb0b
de5563a
e7cca69
eab277a
38e56cd
bf0b3e5
d5a5d75
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,122 @@ | ||||||
| .. _`intro-overrides`: | ||||||
|
|
||||||
| Overrides | ||||||
| ========= | ||||||
|
|
||||||
| Overrides contains mapping rules to associate which URLs a particular | ||||||
| Page Object would be used. The URL matching rules is handled by another library | ||||||
| called `url-matcher <https://url-matcher.readthedocs.io>`_. | ||||||
|
|
||||||
| Using such matching rules establishes the core concept of Overrides wherein | ||||||
| its able to use specific Page Objects in lieu of the original one. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| This enables ``web-poet`` to be used effectively by other frameworks like | ||||||
| `scrapy-poet <https://scrapy-poet.readthedocs.io>`_. | ||||||
|
Comment on lines
+6
to
+14
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. todo: think about how can we simplify the description here. There is nothing wrong with the current description, but it was a bit hard for me to understand what it means. It becomes much more clear in the next section. |
||||||
|
|
||||||
| Example Use Case | ||||||
BurnzZ marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| ---------------- | ||||||
|
|
||||||
| Let's explore an example use case for the Overrides concept. | ||||||
|
|
||||||
| Suppose we're using Page Objects for our broadcrawl project which explores | ||||||
| eCommerce websites to discover product pages. It wouldn't be entirely possible | ||||||
| for us to create parsers for all websites since we don't know which sites we're | ||||||
| going to crawl beforehand. | ||||||
|
|
||||||
| However, we could at least create a generic Page Object to support parsing of | ||||||
| some fields in well-known locations of product information like ``<title>``. | ||||||
| This enables our broadcrawler to at least parse some useful information. Let's | ||||||
| call such Page Object to be ``GenericProductPage``. | ||||||
|
|
||||||
| Assuming that one of our project requirements is to fully support parsing of the | ||||||
| `top 3 eCommerce websites`, then we'd need to create a Page Object for each one | ||||||
| to parse more specific fields. | ||||||
|
|
||||||
| Here's where the Overrides concept comes in: | ||||||
|
|
||||||
| 1. The ``GenericProductPage`` is used to parse all eCommerce product pages | ||||||
| `by default`. | ||||||
| 2. Whenever one of our declared URL rules matches with a given page URL, | ||||||
| then the Page Object associated with that rule `overrides (or replaces)` | ||||||
| the default ``GenericProductPage``. | ||||||
|
|
||||||
| This enables us to fine tune our parsing logic `(which are abstracted away for | ||||||
| each Page Object)` depending on the page we're parsing. | ||||||
|
|
||||||
| Let's see this in action by creating Page Objects below. | ||||||
|
|
||||||
|
|
||||||
| Creating Overrides | ||||||
| ------------------ | ||||||
|
|
||||||
| Let's take a look at how the following code is structured: | ||||||
|
|
||||||
| .. code-block:: python | ||||||
|
|
||||||
| from web_poet import handle_urls | ||||||
| from web_poet.pages import ItemWebPage | ||||||
|
|
||||||
| class GenericProductPage(ItemWebPage): | ||||||
| def to_item(self): | ||||||
| return {"product title": self.css("title::text").get()} | ||||||
|
|
||||||
| @handle_urls("example.com", overrides=GenericProductPage) | ||||||
| class ExampleProductPage(ItemWebPage): | ||||||
| def to_item(self): | ||||||
| ... # more specific parsing | ||||||
|
|
||||||
| @handle_urls("anotherexample.com", overrides=GenericProductPage, exclude="/digital-goods/") | ||||||
| class AnotherExampleProductPage(ItemWebPage): | ||||||
| def to_item(self): | ||||||
| ... # more specific parsing | ||||||
|
|
||||||
| @handle_urls(["dualexample.com", "dualexample.net"], overrides=GenericProductPage) | ||||||
| class DualExampleProductPage(ItemWebPage): | ||||||
| def to_item(self): | ||||||
| ... # more specific parsing | ||||||
|
|
||||||
| The code above declares that: | ||||||
|
|
||||||
| - For sites that matches the ``example.com`` pattern, ``ExampleProductPage`` | ||||||
| would be used instead of ``GenericProductPage``. | ||||||
| - The same is true for ``YetAnotherExampleProductPage`` where it is used | ||||||
| instead of ``GenericProductPage`` for two URLs: ``dualexample.com`` and | ||||||
|
||||||
| instead of ``GenericProductPage`` for two URLs: ``dualexample.com`` and | |
| instead of ``GenericProductPage`` for two websites: ``dualexample.com`` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike I think this could actually be URL patterns to improve clarity. I've also updated the examples so that it clearly reflects that it's not simply a URL or a website link but rather a pattern which follows the syntax of url-matcher. We can take this opportunity to showcase some of the power of url-matcher.
Updated this on: 0a2d779
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,6 +14,7 @@ | |
| author='Scrapinghub', | ||
| author_email='[email protected]', | ||
| url='https://github.com/scrapinghub/web-poet', | ||
| entry_points={'console_scripts': ['web_poet = web_poet.__main__:main']}, | ||
| packages=find_packages( | ||
| exclude=( | ||
| 'tests', | ||
|
|
@@ -22,6 +23,8 @@ | |
| install_requires=( | ||
| 'attrs', | ||
| 'parsel', | ||
| 'url-matcher', | ||
| 'tabulate', | ||
BurnzZ marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ), | ||
| classifiers=( | ||
| 'Development Status :: 2 - Pre-Alpha', | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| """ | ||
| This package is just for overrides testing purposes. | ||
| """ | ||
| from typing import Dict, Any, Callable | ||
|
|
||
| from url_matcher import Patterns | ||
|
|
||
| from web_poet import handle_urls, PageObjectRegistry | ||
|
|
||
|
|
||
| class POBase: | ||
| expected_overrides: Callable | ||
| expected_patterns: Patterns | ||
| expected_meta: Dict[str, Any] | ||
|
|
||
|
|
||
| class POTopLevelOverriden1: | ||
| ... | ||
|
|
||
|
|
||
| class POTopLevelOverriden2: | ||
| ... | ||
|
|
||
|
|
||
| secondary_registry = PageObjectRegistry(name="secondary") | ||
|
|
||
|
|
||
| # This first annotation is ignored. A single annotation per registry is allowed | ||
| @handle_urls("example.com", POTopLevelOverriden1) | ||
| @handle_urls("example.com", POTopLevelOverriden1, exclude="/*.jpg|", priority=300) | ||
| class POTopLevel1(POBase): | ||
| expected_overrides = POTopLevelOverriden1 | ||
| expected_patterns = Patterns(["example.com"], ["/*.jpg|"], priority=300) | ||
| expected_meta = {} # type: ignore | ||
|
|
||
|
|
||
| # The second annotation is for a different registry | ||
| @handle_urls("example.com", POTopLevelOverriden2) | ||
| @secondary_registry.handle_urls("example.org", POTopLevelOverriden2) | ||
| class POTopLevel2(POBase): | ||
| expected_overrides = POTopLevelOverriden2 | ||
| expected_patterns = Patterns(["example.com"]) | ||
| expected_meta = {} # type: ignore |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| from url_matcher import Patterns | ||
|
|
||
| from tests.po_lib import POBase | ||
| from web_poet import handle_urls | ||
|
|
||
|
|
||
| class POModuleOverriden: | ||
| ... | ||
|
|
||
|
|
||
| @handle_urls("example.com", overrides=POModuleOverriden, extra_arg="foo") | ||
| class POModule(POBase): | ||
| expected_overrides = POModuleOverriden | ||
| expected_patterns = Patterns(["example.com"]) | ||
| expected_meta = {"extra_arg": "foo"} # type: ignore | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| from url_matcher import Patterns | ||
|
|
||
| from tests.po_lib import POBase | ||
| from web_poet import handle_urls | ||
|
|
||
|
|
||
| class PONestedPkgOverriden: | ||
| ... | ||
|
|
||
|
|
||
| @handle_urls(include=["example.com", "example.org"], exclude=["/*.jpg|"], overrides=PONestedPkgOverriden) | ||
| class PONestedPkg(POBase): | ||
| expected_overrides = PONestedPkgOverriden | ||
| expected_patterns = Patterns(["example.com", "example.org"], ["/*.jpg|"]) | ||
| expected_meta = {} # type: ignore |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| from url_matcher import Patterns | ||
|
|
||
| from tests.po_lib import POBase, secondary_registry | ||
| from web_poet import handle_urls | ||
|
|
||
|
|
||
| class PONestedModuleOverriden: | ||
| ... | ||
|
|
||
|
|
||
| class PONestedModuleOverridenSecondary: | ||
| ... | ||
|
|
||
|
|
||
| @handle_urls(include=["example.com", "example.org"], exclude=["/*.jpg|"], overrides=PONestedModuleOverriden) | ||
| @secondary_registry.handle_urls("example.com", PONestedModuleOverridenSecondary) | ||
| class PONestedModule(POBase): | ||
| expected_overrides = PONestedModuleOverriden | ||
| expected_patterns = Patterns(include=["example.com", "example.org"], exclude=["/*.jpg|"]) | ||
| expected_meta = {} # type: ignore | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| import pytest | ||
| from url_matcher import Patterns | ||
|
|
||
| from tests.po_lib import POTopLevel1, POTopLevel2, POTopLevelOverriden2 | ||
| from tests.po_lib.a_module import POModule | ||
| from tests.po_lib.nested_package import PONestedPkg | ||
| from tests.po_lib.nested_package.a_nested_module import ( | ||
| PONestedModule, | ||
| PONestedModuleOverridenSecondary, | ||
| ) | ||
| from web_poet.overrides import find_page_object_overrides, PageObjectRegistry | ||
|
|
||
|
|
||
| POS = {POTopLevel1, POTopLevel2, POModule, PONestedPkg, PONestedModule} | ||
|
|
||
|
|
||
| def test_list_page_objects_from_pkg(): | ||
| """Tests that metadata is extracted properly from the po_lib package""" | ||
| rules = find_page_object_overrides("tests.po_lib") | ||
| assert {po.use for po in rules} == POS | ||
|
|
||
| for rule in rules: | ||
| assert rule.instead_of == rule.use.expected_overrides, rule.use | ||
| assert rule.for_patterns == rule.use.expected_patterns, rule.use | ||
| assert rule.meta == rule.use.expected_meta, rule.use | ||
|
|
||
|
|
||
| def test_list_page_objects_from_module(): | ||
| rules = find_page_object_overrides("tests.po_lib.a_module") | ||
| assert len(rules) == 1 | ||
| rule = rules[0] | ||
| assert rule.use == POModule | ||
| assert rule.for_patterns == POModule.expected_patterns | ||
| assert rule.instead_of == POModule.expected_overrides | ||
|
|
||
|
|
||
| def test_list_page_objects_from_empty_module(): | ||
| rules = find_page_object_overrides("tests.po_lib.an_empty_module") | ||
| assert len(rules) == 0 | ||
|
|
||
|
|
||
| def test_list_page_objects_from_empty_pkg(): | ||
| rules = find_page_object_overrides("tests.po_lib.an_empty_package") | ||
| assert len(rules) == 0 | ||
|
|
||
|
|
||
| def test_list_page_objects_from_unknown_module(): | ||
| with pytest.raises(ImportError): | ||
| find_page_object_overrides("tests.po_lib.unknown_module") | ||
|
|
||
|
|
||
| def test_list_page_objects_from_imported_registry(): | ||
| rules = find_page_object_overrides("tests.po_lib", registry_name="secondary") | ||
| assert len(rules) == 2 | ||
| rule_for = {po.use: po for po in rules} | ||
|
|
||
| potop2 = rule_for[POTopLevel2] | ||
| assert potop2.for_patterns == Patterns(["example.org"]) | ||
| assert potop2.instead_of == POTopLevelOverriden2 | ||
|
|
||
| pones = rule_for[PONestedModule] | ||
| assert pones.for_patterns == Patterns(["example.com"]) | ||
| assert pones.instead_of == PONestedModuleOverridenSecondary | ||
|
|
||
|
|
||
| def test_list_page_objects_from_non_existing_registry(): | ||
| assert find_page_object_overrides("tests.po_lib", registry_name="not-exist") == [] | ||
|
|
||
|
|
||
| def test_cmd(): | ||
| from web_poet.__main__ import main | ||
|
|
||
| assert main(["tests.po_lib"]) is None | ||
|
|
||
|
|
||
| def test_registry_repr(): | ||
| registry = PageObjectRegistry(name="test") | ||
| assert "name='test'" in str(registry) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,3 @@ | ||
| from .pages import WebPage, ItemPage, ItemWebPage, Injectable | ||
| from .page_inputs import ResponseData | ||
| from .page_inputs import ResponseData | ||
| from .overrides import handle_urls, find_page_object_overrides, PageObjectRegistry |
Uh oh!
There was an error while loading. Please reload this page.