Skip to content
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
a638aec
initial integration of to_return from web_poet
BurnzZ Oct 12, 2022
ee30808
fix tests regarding expectations for param in rule
BurnzZ Oct 13, 2022
0452173
warn the user when the same URL pattern is present in the rule
BurnzZ Oct 13, 2022
e51a63d
add test case for when 'instead_of' and 'to_return' are both present
BurnzZ Oct 19, 2022
6c55de0
simplify tests and assert injected dependencies in the callback
BurnzZ Oct 31, 2022
3117530
add test case focusing on URL presence in the rules
BurnzZ Nov 1, 2022
3a69c83
properly test UndeclaredProvidedTypeError
BurnzZ Nov 1, 2022
a38cb06
refactor solution to resolve item dependencies using providers
BurnzZ Nov 3, 2022
4134457
fix typing for callback_for()
BurnzZ Nov 3, 2022
213549a
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Nov 22, 2022
9a00b63
move test utilies into scrapy_poet/utils/
BurnzZ Nov 23, 2022
49136cb
create recursive dependency resolution
BurnzZ Nov 24, 2022
a2260d7
add more test cases
BurnzZ Nov 29, 2022
9816f42
update ItemProvider to dynamically handle its dependency signature
BurnzZ Nov 30, 2022
86b7a97
code cleanup and fix some tests
BurnzZ Nov 30, 2022
7b8c7f2
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Nov 30, 2022
20f51a6
detect and raise errors on deadlocks
BurnzZ Nov 30, 2022
4b60fa9
fix failing injector test
BurnzZ Nov 30, 2022
caa1be6
ensure that provider dependencies are cached
BurnzZ Nov 30, 2022
ae05e90
modify deadlock detection to a simple try-except
BurnzZ Dec 1, 2022
d6a33a4
fix failing test_injection.py tests
BurnzZ Dec 1, 2022
a4cff73
ensure that .to_item() methods are only called once
BurnzZ Dec 1, 2022
6bc839f
add a test with a deeper dependency tree
BurnzZ Dec 1, 2022
4aedf16
test duplicate dependencies
BurnzZ Dec 1, 2022
56028d7
fix missing tests and imports
BurnzZ Dec 1, 2022
41ff13e
deprecate passing tuples in SCRAPY_POET_OVERRIDES and the Registry wi…
BurnzZ Dec 2, 2022
2ec6414
refactor Injector to simplify recursive dependency resolution of items
BurnzZ Dec 5, 2022
f3fb32d
polish code and tests
BurnzZ Dec 6, 2022
544236f
fix failing mypy and polish code
BurnzZ Dec 6, 2022
29f40ab
update CHANGELOG with new item class support
BurnzZ Dec 6, 2022
66f0c90
fix typo in CHANGELOG
BurnzZ Dec 6, 2022
2697ab0
improve test_web_poet_rules.py
BurnzZ Dec 6, 2022
35b0c8d
polishing comments and typing
BurnzZ Dec 9, 2022
d2beaf8
mention backward incompatible changes in CHANGELOG
BurnzZ Dec 12, 2022
d046903
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Dec 16, 2022
8f1450a
deprecate some settings, modules, and parameters to be overrides-agno…
BurnzZ Dec 16, 2022
6f0d36e
update documentation in line with the new Item Return functionality
BurnzZ Dec 16, 2022
77cf77c
update tutorial with more explanation on how Item Return works
BurnzZ Dec 16, 2022
efbdb66
update CHANGELOG to mention other backward incompatible changes
BurnzZ Dec 21, 2022
9b4cd48
add and improve docstrings, typing, and warning msgs
BurnzZ Dec 21, 2022
5d2f0f9
move some functions to new scrapy_poet.utils.testing module
BurnzZ Dec 21, 2022
58577a8
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Dec 21, 2022
afc04e9
Apply improvements from code review
BurnzZ Dec 21, 2022
4141239
prioritize newer settings than deprecated ones
BurnzZ Dec 21, 2022
dae69d8
simplify to_return doc example
BurnzZ Dec 22, 2022
ccfa9ea
fix and improve docs
BurnzZ Dec 23, 2022
e9bb33d
use DummyResponse on some examples
BurnzZ Dec 23, 2022
3667cc3
remove obsolete test
BurnzZ Dec 23, 2022
22c959d
Polish CHANGELOG from review
BurnzZ Jan 3, 2023
545e8f1
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 3, 2023
83e0e84
fix missing imports in tests
BurnzZ Jan 3, 2023
47f213c
rename 'item type' → 'item class'
BurnzZ Jan 3, 2023
914a334
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 4, 2023
6af1061
Fix conflicts; Merge branch 'new-web-poet' of ssh://github.com/scrapi…
BurnzZ Jan 4, 2023
190e3a6
use web-poet's _create_deprecated_class
BurnzZ Jan 6, 2023
2611199
remove incorrect line in CHANGELOG
BurnzZ Jan 6, 2023
7bd6783
remove scrapy-poet registry in lieu of web-poet's registry
BurnzZ Jan 10, 2023
3c6fdae
avoid using RulesRegistry.search() since it's slow
BurnzZ Jan 10, 2023
ef01f11
add test to check higher priority of PO subclass
BurnzZ Jan 10, 2023
f41b5c2
Merge pull request #103 from scrapinghub/to-return-override-docs
BurnzZ Jan 10, 2023
c658317
use RulesRegistry.search() again after optimizing it
BurnzZ Jan 10, 2023
3e852d7
fix doc grammar
BurnzZ Jan 11, 2023
4d25d8c
mark tests as xfail if it raises UndeclaredProvidedTypeError
BurnzZ Jan 13, 2023
e184c6f
better tests for clashing rules due to independent page objects with …
BurnzZ Jan 13, 2023
bf9b7bf
fix misleading class names
BurnzZ Jan 13, 2023
33a0391
add more tests on deadlock detection
BurnzZ Jan 13, 2023
141c495
use new web-poet==0.7.0
BurnzZ Jan 18, 2023
3d464e6
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 18, 2023
8c410fe
fixed merge conflicts in CHANGELOG
BurnzZ Jan 18, 2023
00d5dd6
improve docs on settings
BurnzZ Jan 19, 2023
fd31c93
Merge branch 'master' into new-web-poet
BurnzZ Jan 19, 2023
199c46b
fix conflict in code
BurnzZ Jan 19, 2023
7c1f5f1
add test for checking deprecated SCRAPY_POET_OVERRIDES
BurnzZ Jan 19, 2023
44c6e60
add test when requesting an item but no page object
BurnzZ Jan 19, 2023
4791576
issue a warning when can't provide a page object or item for a given URL
BurnzZ Jan 19, 2023
e3b7a8e
remove support for custom registry via SCRAPY_POET_OVERRIDES_REGISTRY
BurnzZ Jan 19, 2023
0915b00
re-organize CHANGELOG
BurnzZ Jan 19, 2023
a46b1e2
fix some docs and comments for clarity
BurnzZ Jan 30, 2023
774619c
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 30, 2023
140239a
bump tool versions to fix CI failure
kmike Jan 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 95 additions & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,101 @@ Changelog
TBR
---

* Official support for Python 3.11
This release enables scrapy-poet to fully support item classes as dependencies
in page objects and spider callbacks. The following is now possible:

.. code-block:: python

import attrs
import scrapy
from web_poet import WebPage, handle_urls, field
from scrapy_poet import DummyResponse

@attrs.define
class Image:
url: str

@handle_urls("example.com")
class ProductImagePage(WebPage[Image]):
@field
def url(self) -> str:
return self.css("#product img ::attr(href)").get("")

@attrs.define
class Product:
name: str
image: Image

@handle_urls("example.com")
@attrs.define
class ProductPage(WebPage[Product]):
# ✨ NEW: Notice that the page object can ask for items as dependencies.
# An instance of ``Image`` is injected behind the scenes by calling the
# ``.to_item()`` method of ``ProductImagePage``.
image_item: Image

@field
def name(self) -> str:
return self.css("h1.name ::text").get("")

@field
def image(self) -> Image:
return self.image_item

class MySpider(scrapy.Spider):
name = "myspider"

def start_requests(self):
yield scrapy.Request(
"https://example.com/products/some-product", self.parse
)

# ✨ NEW: Notice that we're directly using the item here and not the
# page object.
def parse(self, response: DummyResponse, item: Product):
return item

In line with this, the following changes were made:

* Added a new ``scrapy_poet.page_input_providers.ItemProvider`` which makes
the usage above possible.
* Multiple changes to the ``scrapy_poet.PageObjectInputProvider`` base class
which are backward incompatible:

* It now accepts an instance of ``scrapy_poet.injection.Injector`` in its
constructor instead of ``scrapy.crawler.Crawler``. Although you can
still access the ``scrapy.crawler.Crawler`` via the ``Injector.crawler``
attribute.
* ``is_provided()`` is now an instance method instead of a class
method.

* An item class is now supported by ``scrapy_poet.callback_for`` alongside
the usual page objects. This means that it won't raise a ``TypeError``
anymore when not passing a subclass of ``web_poet.ItemPage``.
* ``scrapy_poet.overrides.OverridesRegistry`` has been overhauled:

* It is now subclassed from ``web_poet.RulesRegistry`` which allows
outright access to its registry methods.
* It now allows retrieval of rules based on the returned item class.
* ``OverridesRegistry`` (alongside ``SCRAPY_POET_OVERRIDES``) won't
accept tuples as rules anymore. Only ``web_poet.ApplyRule``
instances are allowed.

* As a result, the following type aliases have been removed:
``scrapy_poet.overrides.RuleAsTuple`` and
``scrapy_poet.overrides.RuleFromUser``
* These changes are backward incompatible.

* New exception: ``scrapy_poet.injector_error.ProviderDependencyDeadlockError``.
This is raised when it's not possible to create the dependencies due to
a deadlock in their sub-dependencies, e.g. due to a circular dependency
between page objects.

Other changes:

* Moved some of the utility functions from the test module into
``scrapy_poet.utils.testing``.
* Official support for Python 3.11

0.6.0 (2022-11-24)
------------------
Expand Down
31 changes: 6 additions & 25 deletions docs/intro/basic-tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -432,11 +432,13 @@ are used for the domain

.. code-block:: python

from web_poet import ApplyRule

"SCRAPY_POET_OVERRIDES": [
("toscrape.com", BTSBookListPage, BookListPage),
("toscrape.com", BTSBookPage, BookPage),
("bookpage.com", BPBookListPage, BookListPage),
("bookpage.com", BPBookPage, BookPage)
ApplyRule("toscrape.com", use=BTSBookListPage, instead_of=BookListPage),
ApplyRule("toscrape.com", use=BTSBookPage, instead_of=BookPage),
ApplyRule("bookpage.com", use=BPBookListPage, instead_of=BookListPage),
ApplyRule("bookpage.com", use=BPBookPage, instead_of=BookPage)
]

The spider is now ready to extract books from both sites 😀.
Expand All @@ -452,27 +454,6 @@ for a particular domain, but more complex URL patterns are also possible.
For example, the pattern ``books.toscrape.com/cataloge/category/``
is accepted and it would restrict the override only to category pages.

It is even possible to configure more complex patterns by using the
:py:class:`web_poet.rules.ApplyRule` class instead of a triplet in
the configuration. Another way of declaring the earlier config
for ``SCRAPY_POET_OVERRIDES`` would be the following:

.. code-block:: python

from url_matcher import Patterns
from web_poet import ApplyRule


SCRAPY_POET_OVERRIDES = [
ApplyRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookListPage, instead_of=BookListPage),
ApplyRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookPage, instead_of=BookPage),
ApplyRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookListPage, instead_of=BookListPage),
ApplyRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookPage, instead_of=BookPage),
]

As you can see, this could get verbose. The earlier tuple config simply offers
a shortcut to be more concise.

.. note::

Also see the `url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
Expand Down
4 changes: 2 additions & 2 deletions docs/overrides.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ page.
Some real-world examples on this topic can be found in:

- `Example 1 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_01.py>`_:
rules using tuples
shorter example
- `Example 2 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_02.py>`_:
rules using tuples and :py:class:`web_poet.ApplyRule`
longer example
- `Example 3 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_03.py>`_:
rules using :py:func:`web_poet.handle_urls` decorator and retrieving them
via :py:meth:`web_poet.rules.RulesRegistry.get_rules`
Expand Down
6 changes: 3 additions & 3 deletions example/example/spiders/books_04_overrides_01.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
The default configured PO logic contains the logic for books.toscrape.com
"""
import scrapy
from web_poet import WebPage
from web_poet import ApplyRule, WebPage

from scrapy_poet import callback_for

Expand Down Expand Up @@ -51,8 +51,8 @@ class BooksSpider(scrapy.Spider):
# Configuring different page objects pages from the bookpage.com domain
custom_settings = {
"SCRAPY_POET_OVERRIDES": [
("bookpage.com", BPBookListPage, BookListPage),
("bookpage.com", BPBookPage, BookPage),
ApplyRule("bookpage.com", use=BPBookListPage, instead_of=BookListPage),
ApplyRule("bookpage.com", use=BPBookPage, instead_of=BookPage),
]
}

Expand Down
18 changes: 4 additions & 14 deletions example/example/spiders/books_04_overrides_02.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
at all is applied.
"""
import scrapy
from url_matcher import Patterns
from web_poet import WebPage
from web_poet.rules import ApplyRule

Expand Down Expand Up @@ -63,19 +62,10 @@ class BooksSpider(scrapy.Spider):
# Configuring different page objects pages for different domains
custom_settings = {
"SCRAPY_POET_OVERRIDES": [
("toscrape.com", BTSBookListPage, BookListPage),
("toscrape.com", BTSBookPage, BookPage),
# We could also use the long-form version if we want to.
ApplyRule(
for_patterns=Patterns(["bookpage.com"]),
use=BPBookListPage,
instead_of=BookListPage,
),
ApplyRule(
for_patterns=Patterns(["bookpage.com"]),
use=BPBookPage,
instead_of=BookPage,
),
ApplyRule("toscrape.com", use=BTSBookListPage, instead_of=BookListPage),
ApplyRule("toscrape.com", use=BTSBookPage, instead_of=BookPage),
ApplyRule("bookpage.com", use=BPBookListPage, instead_of=BookListPage),
ApplyRule("bookpage.com", use=BPBookPage, instead_of=BookPage),
]
}

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ multi_line_output = 3
module = [
"tests.test_cache.*",
"tests.test_downloader.*",
"tests.test_web_poet_rules.*",
"tests.test_scrapy_dependencies.*",
]
# Ignore this type of error since mypy expects an Iterable return
Expand Down
29 changes: 17 additions & 12 deletions scrapy_poet/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,9 @@ def __init__(self, url: str, request=Optional[Request]):
super().__init__(url=url, request=request)


def callback_for(page_cls: Type[ItemPage]) -> Callable:
"""Create a callback for an :class:`web_poet.pages.ItemPage` subclass.
def callback_for(page_or_item_cls: Type) -> Callable:
"""Create a callback for an :class:`web_poet.pages.ItemPage` subclass or an
item class.

The generated callback returns the output of the
``ItemPage.to_item()`` method, i.e. extracts a single item
Expand Down Expand Up @@ -104,24 +105,28 @@ def parse(self, response):
disk queues, because in this case Scrapy is able to serialize
your request object.
"""
if not issubclass(page_cls, ItemPage):
raise TypeError(f"{page_cls.__name__} should be a subclass of ItemPage.")

# When the callback is used as an instance method of the spider, it expects
# to receive 'self' as its first argument. When used as a simple inline
# function, it expects to receive a response as its first argument.
#
# To avoid a TypeError, we need to receive a list of unnamed arguments and
# a dict of named arguments after our injectable.
def parse(*args, page: page_cls, **kwargs): # type: ignore
yield page.to_item() # type: ignore
if issubclass(page_or_item_cls, ItemPage):

def parse(*args, page: page_or_item_cls, **kwargs): # type: ignore
yield page.to_item() # type: ignore

async def async_parse(*args, page: page_or_item_cls, **kwargs): # type: ignore
yield await page.to_item() # type: ignore

if iscoroutinefunction(page_or_item_cls.to_item):
setattr(async_parse, _CALLBACK_FOR_MARKER, True)
return async_parse

async def async_parse(*args, page: page_cls, **kwargs): # type: ignore
yield await page.to_item() # type: ignore
else:

if iscoroutinefunction(page_cls.to_item):
setattr(async_parse, _CALLBACK_FOR_MARKER, True)
return async_parse
def parse(*args, item: page_or_item_cls, **kwargs): # type:ignore
yield item

setattr(parse, _CALLBACK_FOR_MARKER, True)
return parse
2 changes: 2 additions & 0 deletions scrapy_poet/downloadermiddlewares.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from .page_input_providers import (
HttpClientProvider,
HttpResponseProvider,
ItemProvider,
PageParamsProvider,
RequestUrlProvider,
ResponseUrlProvider,
Expand All @@ -31,6 +32,7 @@
PageParamsProvider: 700,
RequestUrlProvider: 800,
ResponseUrlProvider: 900,
ItemProvider: 1000,
}

InjectionMiddlewareTV = TypeVar("InjectionMiddlewareTV", bound="InjectionMiddleware")
Expand Down
2 changes: 1 addition & 1 deletion scrapy_poet/injection.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ def load_providers(self, default_providers: Optional[Mapping] = None): # noqa:
}
provider_classes = build_component_list(providers_dict)
logger.info(f"Loading providers:\n {pprint.pformat(provider_classes)}")
self.providers = [load_object(cls)(self.crawler) for cls in provider_classes]
self.providers = [load_object(cls)(self) for cls in provider_classes]
check_all_providers_are_callable(self.providers)
# Caching whether each provider requires the scrapy response
self.is_provider_requiring_scrapy_response = {
Expand Down
12 changes: 12 additions & 0 deletions scrapy_poet/injection_errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,15 @@ class UndeclaredProvidedTypeError(InjectionError):

class MalformedProvidedClassesError(InjectionError):
pass


class ProviderDependencyDeadlockError(InjectionError):
"""This is raised when it's not possible to create the dependencies due to
deadlock.

For example:
- Page object named "ChickenPage" require "EggPage" as a dependency.
- Page object named "EggPage" require "ChickenPage" as a dependency.
"""

pass
Loading