Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
a638aec
initial integration of to_return from web_poet
BurnzZ Oct 12, 2022
ee30808
fix tests regarding expectations for param in rule
BurnzZ Oct 13, 2022
0452173
warn the user when the same URL pattern is present in the rule
BurnzZ Oct 13, 2022
e51a63d
add test case for when 'instead_of' and 'to_return' are both present
BurnzZ Oct 19, 2022
6c55de0
simplify tests and assert injected dependencies in the callback
BurnzZ Oct 31, 2022
3117530
add test case focusing on URL presence in the rules
BurnzZ Nov 1, 2022
3a69c83
properly test UndeclaredProvidedTypeError
BurnzZ Nov 1, 2022
a38cb06
refactor solution to resolve item dependencies using providers
BurnzZ Nov 3, 2022
4134457
fix typing for callback_for()
BurnzZ Nov 3, 2022
213549a
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Nov 22, 2022
9a00b63
move test utilies into scrapy_poet/utils/
BurnzZ Nov 23, 2022
49136cb
create recursive dependency resolution
BurnzZ Nov 24, 2022
a2260d7
add more test cases
BurnzZ Nov 29, 2022
9816f42
update ItemProvider to dynamically handle its dependency signature
BurnzZ Nov 30, 2022
86b7a97
code cleanup and fix some tests
BurnzZ Nov 30, 2022
7b8c7f2
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Nov 30, 2022
20f51a6
detect and raise errors on deadlocks
BurnzZ Nov 30, 2022
4b60fa9
fix failing injector test
BurnzZ Nov 30, 2022
caa1be6
ensure that provider dependencies are cached
BurnzZ Nov 30, 2022
ae05e90
modify deadlock detection to a simple try-except
BurnzZ Dec 1, 2022
d6a33a4
fix failing test_injection.py tests
BurnzZ Dec 1, 2022
a4cff73
ensure that .to_item() methods are only called once
BurnzZ Dec 1, 2022
6bc839f
add a test with a deeper dependency tree
BurnzZ Dec 1, 2022
4aedf16
test duplicate dependencies
BurnzZ Dec 1, 2022
56028d7
fix missing tests and imports
BurnzZ Dec 1, 2022
41ff13e
deprecate passing tuples in SCRAPY_POET_OVERRIDES and the Registry wi…
BurnzZ Dec 2, 2022
2ec6414
refactor Injector to simplify recursive dependency resolution of items
BurnzZ Dec 5, 2022
f3fb32d
polish code and tests
BurnzZ Dec 6, 2022
544236f
fix failing mypy and polish code
BurnzZ Dec 6, 2022
29f40ab
update CHANGELOG with new item class support
BurnzZ Dec 6, 2022
66f0c90
fix typo in CHANGELOG
BurnzZ Dec 6, 2022
2697ab0
improve test_web_poet_rules.py
BurnzZ Dec 6, 2022
35b0c8d
polishing comments and typing
BurnzZ Dec 9, 2022
d2beaf8
mention backward incompatible changes in CHANGELOG
BurnzZ Dec 12, 2022
d046903
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Dec 16, 2022
8f1450a
deprecate some settings, modules, and parameters to be overrides-agno…
BurnzZ Dec 16, 2022
6f0d36e
update documentation in line with the new Item Return functionality
BurnzZ Dec 16, 2022
77cf77c
update tutorial with more explanation on how Item Return works
BurnzZ Dec 16, 2022
efbdb66
update CHANGELOG to mention other backward incompatible changes
BurnzZ Dec 21, 2022
9b4cd48
add and improve docstrings, typing, and warning msgs
BurnzZ Dec 21, 2022
5d2f0f9
move some functions to new scrapy_poet.utils.testing module
BurnzZ Dec 21, 2022
58577a8
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Dec 21, 2022
afc04e9
Apply improvements from code review
BurnzZ Dec 21, 2022
4141239
prioritize newer settings than deprecated ones
BurnzZ Dec 21, 2022
dae69d8
simplify to_return doc example
BurnzZ Dec 22, 2022
ccfa9ea
fix and improve docs
BurnzZ Dec 23, 2022
e9bb33d
use DummyResponse on some examples
BurnzZ Dec 23, 2022
3667cc3
remove obsolete test
BurnzZ Dec 23, 2022
22c959d
Polish CHANGELOG from review
BurnzZ Jan 3, 2023
545e8f1
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 3, 2023
83e0e84
fix missing imports in tests
BurnzZ Jan 3, 2023
47f213c
rename 'item type' → 'item class'
BurnzZ Jan 3, 2023
914a334
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 4, 2023
6af1061
Fix conflicts; Merge branch 'new-web-poet' of ssh://github.com/scrapi…
BurnzZ Jan 4, 2023
190e3a6
use web-poet's _create_deprecated_class
BurnzZ Jan 6, 2023
2611199
remove incorrect line in CHANGELOG
BurnzZ Jan 6, 2023
7bd6783
remove scrapy-poet registry in lieu of web-poet's registry
BurnzZ Jan 10, 2023
3c6fdae
avoid using RulesRegistry.search() since it's slow
BurnzZ Jan 10, 2023
ef01f11
add test to check higher priority of PO subclass
BurnzZ Jan 10, 2023
f41b5c2
Merge pull request #103 from scrapinghub/to-return-override-docs
BurnzZ Jan 10, 2023
c658317
use RulesRegistry.search() again after optimizing it
BurnzZ Jan 10, 2023
3e852d7
fix doc grammar
BurnzZ Jan 11, 2023
4d25d8c
mark tests as xfail if it raises UndeclaredProvidedTypeError
BurnzZ Jan 13, 2023
e184c6f
better tests for clashing rules due to independent page objects with …
BurnzZ Jan 13, 2023
bf9b7bf
fix misleading class names
BurnzZ Jan 13, 2023
33a0391
add more tests on deadlock detection
BurnzZ Jan 13, 2023
141c495
use new web-poet==0.7.0
BurnzZ Jan 18, 2023
3d464e6
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 18, 2023
8c410fe
fixed merge conflicts in CHANGELOG
BurnzZ Jan 18, 2023
00d5dd6
improve docs on settings
BurnzZ Jan 19, 2023
fd31c93
Merge branch 'master' into new-web-poet
BurnzZ Jan 19, 2023
199c46b
fix conflict in code
BurnzZ Jan 19, 2023
7c1f5f1
add test for checking deprecated SCRAPY_POET_OVERRIDES
BurnzZ Jan 19, 2023
44c6e60
add test when requesting an item but no page object
BurnzZ Jan 19, 2023
4791576
issue a warning when can't provide a page object or item for a given URL
BurnzZ Jan 19, 2023
e3b7a8e
remove support for custom registry via SCRAPY_POET_OVERRIDES_REGISTRY
BurnzZ Jan 19, 2023
0915b00
re-organize CHANGELOG
BurnzZ Jan 19, 2023
a46b1e2
fix some docs and comments for clarity
BurnzZ Jan 30, 2023
774619c
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 30, 2023
140239a
bump tool versions to fix CI failure
kmike Jan 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ multi_line_output = 3
[[tool.mypy.overrides]]
module = [
"tests.test_downloader.*",
"tests.test_web_poet_rules.*",
"tests.test_scrapy_dependencies.*",
]
# Ignore this type of error since mypy expects an Iterable return
Expand Down
46 changes: 25 additions & 21 deletions scrapy_poet/api.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from inspect import iscoroutinefunction
from typing import Callable, Optional, Type
from typing import Any, Callable, Optional

from scrapy.http import Request, Response
from web_poet.pages import ItemPage
Expand Down Expand Up @@ -29,8 +29,9 @@ def __init__(self, url: str, request=Optional[Request]):
super().__init__(url=url, request=request)


def callback_for(page_cls: Type[ItemPage]) -> Callable:
"""Create a callback for an :class:`web_poet.pages.ItemPage` subclass.
def callback_for(page_or_item_cls: Any) -> Callable:
"""Create a callback for an :class:`web_poet.pages.ItemPage` subclass or an
item type.

The generated callback returns the output of the
``ItemPage.to_item()`` method, i.e. extracts a single item
Expand Down Expand Up @@ -104,24 +105,27 @@ def parse(self, response):
disk queues, because in this case Scrapy is able to serialize
your request object.
"""
if not issubclass(page_cls, ItemPage):
raise TypeError(f"{page_cls.__name__} should be a subclass of ItemPage.")

# When the callback is used as an instance method of the spider, it expects
# to receive 'self' as its first argument. When used as a simple inline
# function, it expects to receive a response as its first argument.
#
# To avoid a TypeError, we need to receive a list of unnamed arguments and
# a dict of named arguments after our injectable.
def parse(*args, page: page_cls, **kwargs): # type: ignore
yield page.to_item() # type: ignore

async def async_parse(*args, page: page_cls, **kwargs): # type: ignore
yield await page.to_item() # type: ignore

if iscoroutinefunction(page_cls.to_item):
setattr(async_parse, _CALLBACK_FOR_MARKER, True)
return async_parse
if issubclass(page_or_item_cls, ItemPage):
# When the callback is used as an instance method of the spider, it expects
# to receive 'self' as its first argument. When used as a simple inline
# function, it expects to receive a response as its first argument.
#
# To avoid a TypeError, we need to receive a list of unnamed arguments and
# a dict of named arguments after our injectable.
def parse(*args, page: page_or_item_cls, **kwargs): # type: ignore
yield page.to_item() # type: ignore

async def async_parse(*args, page: page_or_item_cls, **kwargs): # type: ignore
yield await page.to_item() # type: ignore

if iscoroutinefunction(page_or_item_cls.to_item):
setattr(async_parse, _CALLBACK_FOR_MARKER, True)
return async_parse

else:

def parse(*args, item: page_or_item_cls, **kwargs): # type:ignore
yield item

setattr(parse, _CALLBACK_FOR_MARKER, True)
return parse
2 changes: 2 additions & 0 deletions scrapy_poet/downloadermiddlewares.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from .page_input_providers import (
HttpClientProvider,
HttpResponseProvider,
ItemProvider,
PageParamsProvider,
RequestUrlProvider,
ResponseUrlProvider,
Expand All @@ -31,6 +32,7 @@
PageParamsProvider: 700,
RequestUrlProvider: 800,
ResponseUrlProvider: 900,
ItemProvider: 1000,
}

InjectionMiddlewareTV = TypeVar("InjectionMiddlewareTV", bound="InjectionMiddleware")
Expand Down
78 changes: 69 additions & 9 deletions scrapy_poet/injection.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ def load_providers(self, default_providers: Optional[Mapping] = None): # noqa:
}
provider_classes = build_component_list(providers_dict)
logger.info(f"Loading providers:\n {pprint.pformat(provider_classes)}")
self.providers = [load_object(cls)(self.crawler) for cls in provider_classes]
self.providers = [load_object(cls)(self) for cls in provider_classes]
check_all_providers_are_callable(self.providers)
# Caching whether each provider requires the scrapy response
self.is_provider_requiring_scrapy_response = {
Expand Down Expand Up @@ -141,13 +141,68 @@ def build_plan(self, request: Request) -> andi.Plan:
overrides=self.overrides_registry.overrides_for(request).get,
)

def provider_requirements(self, request: Request, plan: andi.Plan) -> Set[Any]:
"""Return a set of classes which indicate any requirements needed by a
provider in order to successfully provide for the given ``request`` and
``plan``.
"""
provider_requirements = set()
for cls, _ in plan.dependencies:
for provider in self.providers:
if not provider.is_provided(cls):
continue
classes = provider.requirements_for(cls, request)
if classes:
provider_requirements.update(set(classes))
return provider_requirements

@inlineCallbacks
def build_provider_requirements(
self, request: Request, response: Response, plan: andi.Plan
):
"""This builds out any requirements that a provider might need before
calling them.

The instances that are built here would later be used in andi's
'externally_provided' parameter when calling the providers.
"""

provider_requirements = self.provider_requirements(request, plan)
provider_requirements_instances = (
yield from self.build_instances_from_providers(
request, response, provider_requirements
)
)

# TODO: recursive requirements resolution on POs that need
# items which are fulfilled by other POs.

for prov_req in provider_requirements:
for cls, kwargs_spec in andi.plan(prov_req, is_injectable=is_injectable):
if cls not in provider_requirements_instances.keys():
provider_requirements_instances[cls] = cls(
**kwargs_spec.kwargs(provider_requirements_instances)
)

return provider_requirements_instances

@inlineCallbacks
def build_instances(self, request: Request, response: Response, plan: andi.Plan):
"""Build the instances dict from a plan including external dependencies."""
# First we build the external dependencies using the providers
instances = yield from self.build_instances_from_providers(

provider_requirements_instances = yield self.build_provider_requirements(
request, response, plan
)

dependencies = {cls for cls, _ in plan.dependencies}

instances = yield from self.build_instances_from_providers(
request,
response,
dependencies,
externally_provided=provider_requirements_instances,
)

# All the remaining dependencies are internal so they can be built just
# following the andi plan.
for cls, kwargs_spec in plan.dependencies:
Expand All @@ -158,17 +213,22 @@ def build_instances(self, request: Request, response: Response, plan: andi.Plan)

@inlineCallbacks
def build_instances_from_providers(
self, request: Request, response: Response, plan: andi.Plan
self,
request: Request,
response: Response,
dependencies: Set,
externally_provided=None,
):
"""Build dependencies handled by registered providers"""
instances: Dict[Callable, Any] = {}
scrapy_provided_dependencies = self.available_dependencies_for_providers(
request, response
)
dependencies_set = {cls for cls, _ in plan.dependencies}
externally_provided = externally_provided or {}
externally_provided.update(scrapy_provided_dependencies)
for provider in self.providers:
provided_classes = {
cls for cls in dependencies_set if provider.is_provided(cls)
cls for cls in dependencies if provider.is_provided(cls)
}
provided_classes -= instances.keys() # ignore already provided types
if not provided_classes:
Expand Down Expand Up @@ -197,11 +257,11 @@ def build_instances_from_providers(

if not objs:
kwargs = andi.plan(
provider,
provider.dynamic_call_signature or provider,
is_injectable=is_injectable,
externally_provided=scrapy_provided_dependencies,
externally_provided=externally_provided,
full_final_kwargs=False,
).final_kwargs(scrapy_provided_dependencies)
).final_kwargs(externally_provided)
try:

# Invoke the provider to get the data
Expand Down
87 changes: 67 additions & 20 deletions scrapy_poet/overrides.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,24 @@
import logging
from abc import ABC, abstractmethod
from collections import defaultdict
from typing import Callable, Dict, Iterable, List, Mapping, Optional, Tuple, Type, Union
from typing import (
Any,
Callable,
Dict,
Iterable,
List,
Mapping,
Optional,
Tuple,
Type,
Union,
)
from warnings import warn

from scrapy import Request
from scrapy.crawler import Crawler
from url_matcher import Patterns, URLMatcher
from web_poet import ItemPage
from web_poet import ItemPage, RulesRegistry
from web_poet.rules import ApplyRule

logger = logging.getLogger(__name__)
Expand All @@ -26,7 +38,7 @@ def overrides_for(self, request: Request) -> Mapping[Callable, Callable]:
pass


class OverridesRegistry(OverridesRegistryBase):
class OverridesRegistry(OverridesRegistryBase, RulesRegistry):
"""
Overrides registry that reads the overrides from the ``SCRAPY_POET_OVERRIDES``
in the spider settings. It is a list and each rule can be a tuple or an
Expand Down Expand Up @@ -89,14 +101,18 @@ class OverridesRegistry(OverridesRegistryBase):
def from_crawler(cls, crawler: Crawler) -> Crawler:
return cls(crawler.settings.getlist("SCRAPY_POET_OVERRIDES", []))

def __init__(self, rules: Optional[Iterable[RuleFromUser]] = None) -> None:
self.rules: List[ApplyRule] = []
self.matcher: Dict[Type[ItemPage], URLMatcher] = defaultdict(URLMatcher)
for rule in rules or []:
self.add_rule(rule)
logger.debug(f"List of parsed ApplyRules:\n{self.rules}")
def __init__(self, rules: Optional[Iterable[ApplyRule]] = None) -> None:
super().__init__(rules=rules)
self.overrides_matcher: Dict[Type[ItemPage], URLMatcher] = defaultdict(
URLMatcher
)
self.item_matcher: Dict[Any, URLMatcher] = defaultdict(URLMatcher)
for rule_id, rule in enumerate(self._rules):
self.add_rule(rule_id, rule)
logger.debug(f"List of parsed ApplyRules:\n{self._rules}")

def add_rule(self, rule: RuleFromUser) -> None:
def add_rule(self, rule_id: int, rule: ApplyRule) -> None:
# TODO: deprecate this, alongside the tests and docs; Update CHANGELOG
if isinstance(rule, (tuple, list)):
if len(rule) != 3:
raise ValueError(
Expand All @@ -108,16 +124,47 @@ def add_rule(self, rule: RuleFromUser) -> None:
rule = ApplyRule(
for_patterns=Patterns([pattern]), use=use, instead_of=instead_of
)
self.rules.append(rule)
# FIXME: This key will change with the new rule.to_return
self.matcher[rule.instead_of].add_or_update( # type: ignore
len(self.rules) - 1, rule.for_patterns
)

def overrides_for(self, request: Request) -> Mapping[Callable, Callable]:
overrides: Dict[Callable, Callable] = {}
for instead_of, matcher in self.matcher.items():
# A common case when a PO subclasses another one with the same URL
# pattern. See the test_item_return_subclass() test case.
matched = self.item_matcher[rule.to_return]
if [
pattern
for pattern in matched.patterns.values()
if pattern == rule.for_patterns
]:
# TODO: It would be great to also list down the rules having the
# same URL pattern. But this would require some refactoring.
warn(
f"A similar URL pattern {list(matched.patterns.values())} has been "
f"declared earlier which uses to_return={rule.to_return}. When "
f"matching URLs against rules, the latest declared rule is used. "
f"Consider explicitly updating the priority of the rules containing "
f"the said URL pattern to easily match the expectations when reading "
f"the code."
)

if rule.instead_of:
self.overrides_matcher[rule.instead_of].add_or_update(
rule_id, rule.for_patterns
)
if rule.to_return:
self.item_matcher[rule.to_return].add_or_update(rule_id, rule.for_patterns)

# TODO: These URL matching functionalities could be moved to web-poet.

def _run_matcher(
self, request: Request, url_matcher
) -> Mapping[Callable, Callable]:
result: Dict[Callable, Callable] = {}
for target, matcher in url_matcher.items():
rule_id = matcher.match(request.url)
if rule_id is not None:
overrides[instead_of] = self.rules[rule_id].use
return overrides
result[target] = self._rules[rule_id].use
return result

def overrides_for(self, request: Request) -> Mapping[Callable, Callable]:
return self._run_matcher(request, self.overrides_matcher)

def page_object_for_item(self, request: Request) -> Mapping[Callable, Callable]:
return self._run_matcher(request, self.item_matcher)
Loading