Skip to content

Commit f582076

Browse files
authored
Add session management (#193)
1 parent 919bf42 commit f582076

21 files changed

+3724
-52
lines changed

.coveragerc

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
[run]
2+
branch = true
3+
include = scrapy_zyte_api/*
4+
omit =
5+
tests/*
6+
disable_warnings = include-ignored
7+
8+
[report]
9+
# https://github.com/nedbat/coveragepy/issues/831#issuecomment-517778185
10+
exclude_lines =
11+
pragma: no cover
12+
if TYPE_CHECKING:

.github/workflows/test.yml

+3-2
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,9 @@ jobs:
5858
run: |
5959
tox -e ${{ matrix.toxenv || 'py' }}
6060
- name: coverage
61-
if: ${{ success() }}
62-
run: bash <(curl -s https://codecov.io/bash)
61+
uses: codecov/codecov-action@v4
62+
with:
63+
token: ${{ secrets.CODECOV_TOKEN }}
6364

6465
check:
6566
runs-on: ubuntu-latest

CHANGES.rst

+10-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,15 @@
11
Changes
22
=======
33

4+
N.N.N (YYYY-MM-DD)
5+
------------------
6+
7+
* The recommended position for ``ScrapyZyteAPIDownloaderMiddleware`` changed
8+
from 1000 to 633, to accommodate for the new
9+
``ScrapyZyteAPISessionDownloaderMiddleware``, which needs to be after
10+
``ScrapyZyteAPIDownloaderMiddleware`` and before the Scrapy cookie downloader
11+
middleware (700).
12+
413
0.18.4 (2024-06-10)
514
-------------------
615

@@ -396,7 +405,7 @@ When upgrading, you should set the following in your Scrapy settings:
396405
.. code-block:: python
397406
398407
DOWNLOADER_MIDDLEWARES = {
399-
"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000,
408+
"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 633,
400409
}
401410
# only applicable for Scrapy 2.7+
402411
REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"

docs/conf.py

+4
Original file line numberDiff line numberDiff line change
@@ -56,4 +56,8 @@
5656
"https://zyte-common-items.readthedocs.io/en/latest",
5757
None,
5858
),
59+
"zyte-spider-templates": (
60+
"https://zyte-spider-templates.readthedocs.io/en/latest",
61+
None,
62+
),
5963
}

docs/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ either :ref:`globally <transparent>` or :ref:`per request <automap>`, or
2626
usage/default
2727
usage/retry
2828
usage/scrapy-poet
29+
usage/session
2930
usage/stats
3031
usage/fingerprint
3132
usage/proxy

docs/reference/meta.rst

+43
Original file line numberDiff line numberDiff line change
@@ -86,3 +86,46 @@ string.
8686
<https://github.com/jd/tenacity/issues/147>`_.
8787

8888
See :ref:`retry`.
89+
90+
91+
.. reqmeta:: zyte_api_session_enabled
92+
93+
zyte_api_session_enabled
94+
=========================
95+
96+
Default: :setting:`ZYTE_API_SESSION_ENABLED`
97+
98+
Whether to use :ref:`scrapy-zyte-api session management <session>` for the
99+
request (``True``) or not (``False``).
100+
101+
102+
.. reqmeta:: zyte_api_session_location
103+
104+
zyte_api_session_location
105+
=========================
106+
107+
Default: ``{}``
108+
109+
Address for ``setLocation``-based session initialization. See
110+
:setting:`ZYTE_API_SESSION_LOCATION` for details.
111+
112+
This request metadata key, if not empty, takes precedence over the
113+
:setting:`ZYTE_API_SESSION_LOCATION` setting, the
114+
:setting:`ZYTE_API_SESSION_PARAMS` setting, and the
115+
:reqmeta:`zyte_api_session_location` request metadata key.
116+
117+
118+
.. reqmeta:: zyte_api_session_params
119+
120+
zyte_api_session_params
121+
=======================
122+
123+
Default: ``{}``
124+
125+
Parameters to use for session initialization. See
126+
:setting:`ZYTE_API_SESSION_PARAMS` for details.
127+
128+
This request metadata key, if not empty, takes precedence over the
129+
:setting:`ZYTE_API_SESSION_PARAMS` setting, but it can be overridden
130+
by the :setting:`ZYTE_API_SESSION_LOCATION` setting or the
131+
:reqmeta:`zyte_api_session_location` request metadata key.

docs/reference/request.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ They will be mapped even if defined with their default value.
173173
Headers will also be mapped if set to a non-default value elsewhere, e.g. in a
174174
custom downloader middleware, as long as it is done before the scrapy-zyte-api
175175
downloader middleware, which is responsible for the mapping, processes the
176-
request. Here “before” means a lower value than ``1000`` in the
176+
request. Here “before” means a lower value than ``633`` in the
177177
:setting:`DOWNLOADER_MIDDLEWARES <scrapy:DOWNLOADER_MIDDLEWARES>` setting.
178178

179179
Similarly, you can add any of those headers to the

docs/reference/settings.rst

+257-1
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,8 @@ ZYTE_API_MAX_REQUESTS
198198
Default: ``None``
199199

200200
When set to an integer value > 0, the spider will close when the number of Zyte
201-
API requests reaches it.
201+
API requests reaches it, with ``closespider_max_zapi_requests`` as the close
202+
reason.
202203

203204
Note that requests with error responses that cannot be retried or exceed their
204205
retry limit also count here.
@@ -246,6 +247,261 @@ subclass.
246247
See :ref:`retry`.
247248

248249

250+
.. setting:: ZYTE_API_SESSION_CHECKER
251+
252+
ZYTE_API_SESSION_CHECKER
253+
========================
254+
255+
Default: ``None``
256+
257+
A :ref:`Scrapy component <topics-components>` (or its import path as a string)
258+
that defines a ``check`` method.
259+
260+
If ``check`` returns ``True``, the response session is considered valid; if
261+
``check`` returns ``False``, the response session is considered invalid, and
262+
will be discarded. ``check`` can also raise a
263+
:exc:`~scrapy.exceptions.CloseSpider` exception to close the spider.
264+
265+
If defined, the ``check`` method is called on every response that is using a
266+
:ref:`session managed by scrapy-zyte-api <session>`. If not defined, the
267+
default implementation checks the outcome of the ``setLocation`` action if
268+
session initialization was location-based, as described in
269+
:ref:`session-check`.
270+
271+
Example:
272+
273+
.. code-block:: python
274+
:caption: settings.py
275+
276+
from scrapy import Request
277+
from scrapy.http.response import Response
278+
279+
280+
class MySessionChecker:
281+
282+
def check(self, request: Request, response: Response) -> bool:
283+
return bool(response.css(".is_valid"))
284+
285+
286+
ZYTE_API_SESSION_CHECKER = MySessionChecker
287+
288+
Because the session checker is a Scrapy component, you can access the crawler
289+
object, for example to read settings:
290+
291+
.. code-block:: python
292+
:caption: settings.py
293+
294+
from scrapy import Request
295+
from scrapy.http.response import Response
296+
297+
298+
class MySessionChecker:
299+
300+
@classmethod
301+
def from_crawler(cls, crawler):
302+
return cls(crawler)
303+
304+
def __init__(self, crawler):
305+
location = crawler.settings["ZYTE_API_SESSION_LOCATION"]
306+
self.postal_code = location["postalCode"]
307+
308+
def check(self, request: Request, response: Response) -> bool:
309+
return response.css(".postal_code::text").get() == self.postal_code
310+
311+
312+
ZYTE_API_SESSION_CHECKER = MySessionChecker
313+
314+
315+
.. setting:: ZYTE_API_SESSION_ENABLED
316+
317+
ZYTE_API_SESSION_ENABLED
318+
========================
319+
320+
Default: ``False``
321+
322+
Enables :ref:`scrapy-zyte-api session management <session>`.
323+
324+
325+
.. setting:: ZYTE_API_SESSION_LOCATION
326+
327+
ZYTE_API_SESSION_LOCATION
328+
=========================
329+
330+
Default: ``{}``
331+
332+
If defined, sessions are initialized using the ``setLocation``
333+
:http:`action <request:actions>`, and the value of this setting must be the
334+
target address :class:`dict`. For example:
335+
336+
.. code-block:: python
337+
:caption: settings.py
338+
339+
ZYTE_API_SESSION_LOCATION = {"postalCode": "10001"}
340+
341+
If the :setting:`ZYTE_API_SESSION_PARAMS` setting or the
342+
:reqmeta:`zyte_api_session_params` request metadata key set a ``"url"``, it
343+
will be used for session initialization as well. Otherwise, the URL of the
344+
request for which the session is being initialized will be used instead.
345+
346+
This setting, if not empty, takes precedence over the
347+
:setting:`ZYTE_API_SESSION_PARAMS` setting and the
348+
:reqmeta:`zyte_api_session_params` request metadata key, but it can be
349+
overridden by the :reqmeta:`zyte_api_session_location` request metadata key.
350+
351+
To disable the :setting:`ZYTE_API_SESSION_LOCATION` setting on a specific
352+
request, e.g. to use the :setting:`ZYTE_API_SESSION_PARAMS` setting or the
353+
:reqmeta:`zyte_api_session_params` request metadata key instead, set
354+
the :reqmeta:`zyte_api_session_location` request metadata key to ``{}``.
355+
356+
357+
.. setting:: ZYTE_API_SESSION_MAX_BAD_INITS
358+
359+
ZYTE_API_SESSION_MAX_BAD_INITS
360+
==============================
361+
362+
Default: ``8``
363+
364+
The maximum number of :ref:`scrapy-zyte-api sessions <session>` per pool that
365+
are allowed to fail their session check right after creation in a row. If the
366+
maximum is reached, the spider closes with ``bad_session_inits`` as the close
367+
reason.
368+
369+
To override this value for specific pools, use
370+
:setting:`ZYTE_API_SESSION_MAX_BAD_INITS_PER_POOL`.
371+
372+
373+
.. setting:: ZYTE_API_SESSION_MAX_BAD_INITS_PER_POOL
374+
375+
ZYTE_API_SESSION_MAX_BAD_INITS_PER_POOL
376+
=======================================
377+
378+
Default: ``{}``
379+
380+
:class:`dict` where keys are :ref:`pool <session-pools>` IDs and values are
381+
overrides of :setting:`ZYTE_API_SESSION_POOL_SIZE` for those pools.
382+
383+
384+
.. setting:: ZYTE_API_SESSION_MAX_ERRORS
385+
386+
ZYTE_API_SESSION_MAX_ERRORS
387+
===========================
388+
389+
Default: ``1``
390+
391+
Maximum number of :ref:`unsuccessful responses
392+
<zyte-api-unsuccessful-responses>` allowed for any given session before
393+
discarding the session.
394+
395+
You might want to increase this number if you find that a session may continue
396+
to work even after an unsuccessful response. See :ref:`optimize-sessions`.
397+
398+
.. note:: This setting does not affect session checks
399+
(:setting:`ZYTE_API_SESSION_CHECKER`). A session is always discarded the
400+
first time it fails its session check.
401+
402+
403+
.. setting:: ZYTE_API_SESSION_PARAMS
404+
405+
ZYTE_API_SESSION_PARAMS
406+
=======================
407+
408+
Default: ``{"browserHtml": True}``
409+
410+
Parameters to use for session initialization.
411+
412+
It works similarly to :http:`request:sessionContextParams` from
413+
:ref:`server-managed sessions <zyte-api-session-contexts>`, but it supports
414+
arbitrary Zyte API parameters instead of a specific subset.
415+
416+
If it does not define a ``"url"``, the URL of the request for which the session
417+
is being initialized will be used.
418+
419+
This setting can be overridden by the :setting:`ZYTE_API_SESSION_LOCATION`
420+
setting, the :reqmeta:`zyte_api_session_location` request metadata key, or the
421+
:reqmeta:`zyte_api_session_params` request metadata key.
422+
423+
Example:
424+
425+
.. code-block:: python
426+
:caption: settings.py
427+
428+
ZYTE_API_SESSION_PARAMS = {
429+
"browserHtml": True,
430+
"actions": [
431+
{
432+
"action": "setLocation",
433+
"address": {"postalCode": "10001"},
434+
}
435+
],
436+
}
437+
438+
.. tip:: The example above is equivalent to setting
439+
:setting:`ZYTE_API_SESSION_LOCATION` to ``{"postalCode": "10001"}``.
440+
441+
442+
.. setting:: ZYTE_API_SESSION_POOL_SIZE
443+
444+
ZYTE_API_SESSION_POOL_SIZE
445+
==========================
446+
447+
Default: ``8``
448+
449+
The maximum number of active :ref:`scrapy-zyte-api sessions <session>` to keep
450+
per :ref:`pool <session-pools>`.
451+
452+
To override this value for specific pools, use
453+
:setting:`ZYTE_API_SESSION_POOL_SIZES`.
454+
455+
Increase this number to lower the frequency with which requests are sent
456+
through each session, which on some websites may increase the lifetime of each
457+
session. See :ref:`optimize-sessions`.
458+
459+
460+
.. setting:: ZYTE_API_SESSION_POOL_SIZES
461+
462+
ZYTE_API_SESSION_POOL_SIZES
463+
===========================
464+
465+
Default: ``{}``
466+
467+
:class:`dict` where keys are :ref:`pool <session-pools>` IDs and values are
468+
overrides of :setting:`ZYTE_API_SESSION_POOL_SIZE` for those pools.
469+
470+
471+
.. setting:: ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTS
472+
473+
ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTS
474+
===================================
475+
476+
Default: ``60``
477+
478+
scrapy-zyte-api maintains a rotation queue of ready-to-use sessions per
479+
:ref:`pool <session-pools>`. At some points, the queue might be empty for a
480+
given pool because all its sessions are in the process of being initialized or
481+
refreshed.
482+
483+
If the queue is empty when trying to assign a session to a request,
484+
scrapy-zyte-api will wait some time
485+
(:setting:`ZYTE_API_SESSION_QUEUE_WAIT_TIME`), and then try to get a session
486+
from the queue again.
487+
488+
Use this setting to configure the maximum number of attempts before giving up
489+
and raising a :exc:`RuntimeError` exception.
490+
491+
492+
.. setting:: ZYTE_API_SESSION_QUEUE_WAIT_TIME
493+
494+
ZYTE_API_SESSION_QUEUE_WAIT_TIME
495+
===================================
496+
497+
Default: ``1.0``
498+
499+
Number of seconds to wait between attempts to get a session from a rotation
500+
queue.
501+
502+
See :setting:`ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTS` for details.
503+
504+
249505
.. setting:: ZYTE_API_SKIP_HEADERS
250506

251507
ZYTE_API_SKIP_HEADERS

0 commit comments

Comments
 (0)