Skip to content

Commit 4e2d850

Browse files
authored
Merge pull request #88 from scrapinghub/new-web-poet
Supporting `to_return` in web-poet rules
2 parents 399371f + 140239a commit 4e2d850

33 files changed

+2128
-407
lines changed

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@ repos:
33
- id: black
44
language_version: python3
55
repo: https://github.com/ambv/black
6-
rev: 22.3.0
6+
rev: 22.12.0
77
- hooks:
88
- id: isort
99
language_version: python3
1010
repo: https://github.com/PyCQA/isort
11-
rev: 5.10.1
11+
rev: 5.11.5
1212
- hooks:
1313
- id: flake8
1414
language_version: python3

CHANGELOG.rst

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,128 @@
22
Changelog
33
=========
44

5+
TBR
6+
---
7+
8+
* Added support for item classes which are used as dependencies in page objects
9+
and spider callbacks. The following is now possible:
10+
11+
.. code-block:: python
12+
13+
import attrs
14+
import scrapy
15+
from web_poet import WebPage, handle_urls, field
16+
from scrapy_poet import DummyResponse
17+
18+
@attrs.define
19+
class Image:
20+
url: str
21+
22+
@handle_urls("example.com")
23+
class ProductImagePage(WebPage[Image]):
24+
@field
25+
def url(self) -> str:
26+
return self.css("#product img ::attr(href)").get("")
27+
28+
@attrs.define
29+
class Product:
30+
name: str
31+
image: Image
32+
33+
@handle_urls("example.com")
34+
@attrs.define
35+
class ProductPage(WebPage[Product]):
36+
# ✨ NEW: Notice that the page object can ask for items as dependencies.
37+
# An instance of ``Image`` is injected behind the scenes by calling the
38+
# ``.to_item()`` method of ``ProductImagePage``.
39+
image_item: Image
40+
41+
@field
42+
def name(self) -> str:
43+
return self.css("h1.name ::text").get("")
44+
45+
@field
46+
def image(self) -> Image:
47+
return self.image_item
48+
49+
class MySpider(scrapy.Spider):
50+
name = "myspider"
51+
52+
def start_requests(self):
53+
yield scrapy.Request(
54+
"https://example.com/products/some-product", self.parse
55+
)
56+
57+
# ✨ NEW: Notice that we're directly using the item here and not the
58+
# page object.
59+
def parse(self, response: DummyResponse, item: Product):
60+
return item
61+
62+
63+
In line with this, the following new features were made:
64+
65+
* Added a new :class:`scrapy_poet.page_input_providers.ItemProvider` which
66+
makes the usage above possible.
67+
68+
* An item class is now supported by :func:`scrapy_poet.callback_for`
69+
alongside the usual page objects. This means that it won't raise a
70+
:class:`TypeError` anymore when not passing a subclass of
71+
:class:`web_poet.pages.ItemPage`.
72+
73+
* New exception: :class:`scrapy_poet.injection_errors.ProviderDependencyDeadlockError`.
74+
This is raised when it's not possible to create the dependencies due to
75+
a deadlock in their sub-dependencies, e.g. due to a circular dependency
76+
between page objects.
77+
78+
* Moved some of the utility functions from the test module into
79+
``scrapy_poet.utils.testing``.
80+
81+
* Documentation improvements.
82+
83+
* Deprecations:
84+
85+
* The ``SCRAPY_POET_OVERRIDES`` setting has been replaced by
86+
``SCRAPY_POET_RULES``.
87+
88+
* Backward incompatible changes:
89+
90+
* Overriding the default registry used via ``SCRAPY_POET_OVERRIDES_REGISTRY``
91+
is not possible anymore.
92+
93+
* The following type aliases have been removed:
94+
95+
* ``scrapy_poet.overrides.RuleAsTuple``
96+
* ``scrapy_poet.overrides.RuleFromUser``
97+
98+
* The :class:`scrapy_poet.page_input_providers.PageObjectInputProvider` base
99+
class has these changes:
100+
101+
* It now accepts an instance of :class:`scrapy_poet.injection.Injector`
102+
in its constructor instead of :class:`scrapy.crawler.Crawler`. Although
103+
you can still access the :class:`scrapy.crawler.Crawler` via the
104+
``Injector.crawler`` attribute.
105+
106+
* :meth:`scrapy_poet.page_input_providers.PageObjectInputProvider.is_provided`
107+
is now an instance method instead of a class method.
108+
109+
* The :class:`scrapy_poet.injection.Injector`'s attribute and constructor
110+
parameter called ``overrides_registry`` is now simply called ``registry``.
111+
112+
* The ``scrapy_poet.overrides`` module which contained ``OverridesRegistryBase``
113+
and ``OverridesRegistry`` has now been removed. Instead, scrapy-poet directly
114+
uses :class:`web_poet.rules.RulesRegistry`.
115+
116+
Everything should pretty much the same except for
117+
:meth:`web_poet.rules.RulesRegistry.overrides_for` now accepts :class:`str`,
118+
:class:`web_poet.page_inputs.http.RequestUrl`, or
119+
:class:`web_poet.page_inputs.http.ResponseUrl` instead of
120+
:class:`scrapy.http.Request`.
121+
122+
* This also means that the registry doesn't accept tuples as rules anymore.
123+
Only :class:`web_poet.rules.ApplyRule` instances are allowed. The same goes
124+
for ``SCRAPY_POET_RULES`` (and the deprecated ``SCRAPY_POET_OVERRIDES``).
125+
126+
5127
0.8.0 (2023-01-24)
6128
------------------
7129

docs/api_reference.rst

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ API
1414
Injection Middleware
1515
====================
1616

17-
.. automodule:: scrapy_poet.middleware
17+
.. automodule:: scrapy_poet.downloadermiddlewares
1818
:members:
1919

2020
Page Input Providers
@@ -43,9 +43,3 @@ Injection errors
4343

4444
.. automodule:: scrapy_poet.injection_errors
4545
:members:
46-
47-
Overrides
48-
=========
49-
50-
.. automodule:: scrapy_poet.overrides
51-
:members:

docs/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ To get started, see :ref:`intro-install` and :ref:`intro-tutorial`.
4343
:caption: Advanced
4444
:maxdepth: 1
4545

46-
overrides
46+
rules-from-web-poet
4747
providers
4848
testing
4949

docs/intro/basic-tutorial.rst

Lines changed: 16 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -414,17 +414,17 @@ The spider won't work anymore after the change. The reason is that it
414414
is using the new base Page Objects and they are empty.
415415
Let's fix it by instructing ``scrapy-poet`` to use the Books To Scrape (BTS)
416416
Page Objects for URLs belonging to the domain ``toscrape.com``. This must
417-
be done by configuring ``SCRAPY_POET_OVERRIDES`` into ``settings.py``:
417+
be done by configuring ``SCRAPY_POET_RULES`` into ``settings.py``:
418418

419419
.. code-block:: python
420420
421-
"SCRAPY_POET_OVERRIDES": [
421+
"SCRAPY_POET_RULES": [
422422
("toscrape.com", BTSBookListPage, BookListPage),
423423
("toscrape.com", BTSBookPage, BookPage)
424424
]
425425
426426
The spider is back to life!
427-
``SCRAPY_POET_OVERRIDES`` contain rules that overrides the Page Objects
427+
``SCRAPY_POET_RULES`` contain rules that overrides the Page Objects
428428
used for a particular domain. In this particular case, Page Objects
429429
``BTSBookListPage`` and ``BTSBookPage`` will be used instead of
430430
``BookListPage`` and ``BookPage`` for any request whose domain is
@@ -465,16 +465,18 @@ to implement new ones:
465465
466466
The last step is configuring the overrides so that these new Page Objects
467467
are used for the domain
468-
``bookpage.com``. This is how ``SCRAPY_POET_OVERRIDES`` should look like into
468+
``bookpage.com``. This is how ``SCRAPY_POET_RULES`` should look like into
469469
``settings.py``:
470470

471471
.. code-block:: python
472472
473-
"SCRAPY_POET_OVERRIDES": [
474-
("toscrape.com", BTSBookListPage, BookListPage),
475-
("toscrape.com", BTSBookPage, BookPage),
476-
("bookpage.com", BPBookListPage, BookListPage),
477-
("bookpage.com", BPBookPage, BookPage)
473+
from web_poet import ApplyRule
474+
475+
"SCRAPY_POET_RULES": [
476+
ApplyRule("toscrape.com", use=BTSBookListPage, instead_of=BookListPage),
477+
ApplyRule("toscrape.com", use=BTSBookPage, instead_of=BookPage),
478+
ApplyRule("bookpage.com", use=BPBookListPage, instead_of=BookListPage),
479+
ApplyRule("bookpage.com", use=BPBookPage, instead_of=BookPage)
478480
]
479481
480482
The spider is now ready to extract books from both sites 😀.
@@ -490,27 +492,6 @@ for a particular domain, but more complex URL patterns are also possible.
490492
For example, the pattern ``books.toscrape.com/cataloge/category/``
491493
is accepted and it would restrict the override only to category pages.
492494

493-
It is even possible to configure more complex patterns by using the
494-
:py:class:`web_poet.rules.ApplyRule` class instead of a triplet in
495-
the configuration. Another way of declaring the earlier config
496-
for ``SCRAPY_POET_OVERRIDES`` would be the following:
497-
498-
.. code-block:: python
499-
500-
from url_matcher import Patterns
501-
from web_poet import ApplyRule
502-
503-
504-
SCRAPY_POET_OVERRIDES = [
505-
ApplyRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookListPage, instead_of=BookListPage),
506-
ApplyRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookPage, instead_of=BookPage),
507-
ApplyRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookListPage, instead_of=BookListPage),
508-
ApplyRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookPage, instead_of=BookPage),
509-
]
510-
511-
As you can see, this could get verbose. The earlier tuple config simply offers
512-
a shortcut to be more concise.
513-
514495
.. note::
515496

516497
Also see the `url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
@@ -530,11 +511,11 @@ and store the :py:class:`web_poet.rules.ApplyRule` for you. All of the
530511
# rules from other packages. Otherwise, it can be omitted.
531512
# More info about this caveat on web-poet docs.
532513
consume_modules("external_package_A", "another_ext_package.lib")
533-
SCRAPY_POET_OVERRIDES = default_registry.get_rules()
514+
SCRAPY_POET_RULES = default_registry.get_rules()
534515
535516
For more info on this, you can refer to these docs:
536517

537-
* ``scrapy-poet``'s :ref:`overrides` Tutorial section.
518+
* ``scrapy-poet``'s :ref:`rules-from-web-poet` Tutorial section.
538519
* External `web-poet`_ docs.
539520

540521
* Specifically, the :external:ref:`rules-intro` Tutorial section.
@@ -545,7 +526,8 @@ Next steps
545526
Now that you know how ``scrapy-poet`` is supposed to work, what about trying to
546527
apply it to an existing or new Scrapy project?
547528

548-
Also, please check the :ref:`overrides` and :ref:`providers` sections as well as
549-
refer to spiders in the "example" folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
529+
Also, please check the :ref:`rules-from-web-poet` and :ref:`providers` sections
530+
as well as refer to spiders in the "example" folder:
531+
https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
550532

551533
.. _Scrapy Tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html

0 commit comments

Comments
 (0)