feat: Make timeout error log cleaner #1170

Pijukatel · 2025-04-23T10:00:04Z

Description

Make timeout based error logs in request handler more readable by filtering clutter in stack trace.
Add one line warning with optional most important stack trace line for failed request handler calls.

Issues

Closes: Improve logging by suppresing irrelevant parts of stack trace #1158

Copilot

Pull Request Overview

This PR makes timeout-based error logs in the BasicCrawler more readable by refining the logging messages and reducing traceback clutter. Key changes include:

Adding a new helper method (_get_message_from_error) to filter and format error messages.
Modifying _handle_failed_request to log errors using the new formatted messages.
Adding a new unit test (test_reduced_logs_from_timed_out_request_handler) to verify that the simplified logs meet expectations.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
tests/unit/crawlers/_basic/test_basic_crawler.py	Adds a unit test ensuring the log output is reduced and includes custom log comments.
src/crawlee/crawlers/_basic/_basic_crawler.py	Refines error logging by introducing helper methods and updating error message formatting.

src/crawlee/crawlers/_basic/_basic_crawler.py

Pijukatel · 2025-04-23T11:05:51Z

Example log reduction:

Look for # INJECTED TO PROVOKE TIMEOUT in the logs as that is the relevant line that timed out.

Timeout in user defined request handler:

Before:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
          await final_context_consumer(cast('TCrawlingContext', crawling_context))
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
          return await self._default_handler(context)
        File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler
          await asyncio.sleep(10)  # INJECTED TO PROVOKE TIMEOUT
        File "/usr/lib/python3.10/asyncio/tasks.py", line 605, in sleep
          return await future
      asyncio.exceptions.CancelledError

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
          return fut.result()
      asyncio.exceptions.CancelledError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
        File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
          raise exceptions.TimeoutError() from exc
      asyncio.exceptions.TimeoutError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1150, in __run_task_function
          await self._run_request_handler(context=context)
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1247, in _run_request_handler
          await wait_for(
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 39, in wait_for
          raise asyncio.TimeoutError(timeout_message) from ex
      asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

After:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
 Traceback (most recent call last):

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
    await final_context_consumer(cast('TCrawlingContext', crawling_context))

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
    return await self._default_handler(context)

  File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler
    await asyncio.sleep(10)  # INJECTED TO PROVOKE TIMEOUT

asyncio.exceptions.CancelledError

asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

Timeout in some internal part of the crawlee pipeline:

Before:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 68, in __call__
          result = await middleware_instance.__anext__()
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 811, in _check_url_after_redirects
          await asyncio.sleep(10) # INJECTED TO PROVOKE TIMEOUT
        File "/usr/lib/python3.10/asyncio/tasks.py", line 605, in sleep
          return await future
      asyncio.exceptions.CancelledError

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
          return fut.result()
      asyncio.exceptions.CancelledError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
        File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
          raise exceptions.TimeoutError() from exc
      asyncio.exceptions.TimeoutError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1151, in __run_task_function
          await self._run_request_handler(context=context)
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1248, in _run_request_handler
          await wait_for(
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 39, in wait_for
          raise asyncio.TimeoutError(timeout_message) from ex
      asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

After:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
 Traceback (most recent call last):

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 68, in __call__
    result = await middleware_instance.__anext__()

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 811, in _check_url_after_redirects
    await asyncio.sleep(10) # INJECTED TO PROVOKE TIMEOUT

asyncio.exceptions.CancelledError

asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

Timeout in user defined pre-navigation hook:

Before:

[BeautifulSoupCrawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 68, in __call__
          result = await middleware_instance.__anext__()
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py", line 112, in _execute_pre_navigation_hooks
          await hook(context)
        File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py", line 162, in pre_nav_hook
          await asyncio.sleep(10) # INJECTED TO PROVOKE TIMEOUT
        File "/usr/lib/python3.10/asyncio/tasks.py", line 605, in sleep
          return await future
      asyncio.exceptions.CancelledError

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
          return fut.result()
      asyncio.exceptions.CancelledError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
        File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
          raise exceptions.TimeoutError() from exc
      asyncio.exceptions.TimeoutError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1150, in __run_task_function
          await self._run_request_handler(context=context)
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1247, in _run_request_handler
          await wait_for(
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 39, in wait_for
          raise asyncio.TimeoutError(timeout_message) from ex
      asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

After:

[BeautifulSoupCrawler] ERROR Request failed and reached maximum retries
 Traceback (most recent call last):

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 68, in __call__
    result = await middleware_instance.__anext__()

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py", line 112, in _execute_pre_navigation_hooks
    await hook(context)

  File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py", line 162, in pre_nav_hook
    await asyncio.sleep(10) # INJECTED TO PROVOKE TIMEOUT

asyncio.exceptions.CancelledError

asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

B4nan · 2025-04-23T12:28:42Z

It's surely better, but it still feels long. I would just show the single line that points to the originating file - the one from the user land, not from dependencies. And ideally get rid of all the other clutter like "Traceback (most recent call last):" or "asyncio.exceptions.CancelledError". Also I'd drop all the blank lines.

Keep in mind we are talking about request retries, those should only produce a warning. The final failure can be bigger since that is what you need to spot easily, but for retries we shouldn't pollute the logs.

Looking at how it works in the JS version, we don't even show the stack trace for timeouts and its all one liners:

Pijukatel · 2025-04-23T12:53:32Z

It's surely better, but it still feels long. I would just show the single line that points to the originating file - the one from the user land, not from dependencies. And ideally get rid of all the other clutter like "Traceback (most recent call last):" or "asyncio.exceptions.CancelledError". Also I'd drop all the blank lines.

Yes, I can safely drop the "Traceback (most recent call last):", "asyncio.exceptions.CancelledError" and blank lines, but I would not recommend to go any further. The timeout error in this place can actually originate from our code as well(not only from user code) and at least minimal trace-back is necessary to debug it. So final result would be like:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
   File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
    await final_context_consumer(cast('TCrawlingContext', crawling_context))
  File "/home/pijukatel/repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
    return await self._default_handler(context)
  File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler
    await asyncio.sleep(10)  # INJECTED TO PROVOKE TIMEOUT
asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

Keep in mind we are talking about request retries, those should only produce a warning. The final failure can be bigger since that is what you need to spot easily, but for retries we shouldn't pollute the logs.

Looking at how it works in the JS version, we don't even show the stack trace for timeouts and its all one liners:

I am only focusing on the final error ( ERROR Request failed and reached maximum retries). In Python version I do not think we even print those failed retries warnings unless it reaches maximum retry and then it is error.

B4nan · 2025-04-23T15:45:46Z

The timeout error in this place can actually originate from our code as well(not only from user code) and at least minimal trace-back is necessary to debug it.

My point is we can detect that and skip the irrelevant stuff based on that.

In Python version I do not think we even print those failed retries warnings unless it reaches maximum retry and then it is error.

IMO we should print them, you should know when things get retried. But it should definitely be just a one liner for those, since it should just inform you things are retrying. If we skip those, you could see nothing for a few minutes, we need to show the progress.

B4nan · 2025-04-23T15:48:22Z

Here is a better example of how the JS version behaves, all the request handler is doing is throwing an error. We show warnings with a one line stack trace pointing to the user code that triggered it, and for the final one we show the full stack trace.

Pijukatel · 2025-04-24T07:58:05Z

...But it should definitely be just a one liner for those, since it should just inform you things are retrying....

I am actually quite skeptical about the possibility of reducing stack trace to single line in general case, while maintaining the required information value. I guess the best we can do about one line log for retries is something like f"Retrying request to {request_url} due to {top_level_error_type} {optional_one_line_that_can_be_extracted_sometimes}"

I think we can reduce exception to meaningful one line only in some specific cases(lets start with timeout and add more if possible) and for others this retry log would not have the one line optional part.

Add online retry warning with extra stacktrace line for Asyncio timeouts

Pijukatel · 2025-04-24T09:14:58Z

With latest commit it would do for example this

[crawlee.crawlers._basic._basic_crawler] WARN  Retrying request to http://a.com/ due to TimeoutError. File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler,     await asyncio.sleep(10)  # INJECTED DELAY
[crawlee.crawlers._basic._basic_crawler] WARN  Retrying request to http://a.com/ due to TimeoutError. File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler,     await asyncio.sleep(10)  # INJECTED DELAY
[crawlee.crawlers._basic._basic_crawler] ERROR Request to http://a.com/ failed and reached maximum retries
   File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
    await final_context_consumer(cast('TCrawlingContext', crawling_context))
  File "/home/pijukatel/repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
    return await self._default_handler(context)
  File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler
    await asyncio.sleep(10)  # INJECTED DELAY
TimeoutError: Request handler timed out after 1.0 seconds

vdusek

Please update uv to the latest version to reduce the upload_time field-related changes in the lock file.

vdusek

TBH, I would slightly prefer having a longer explanation (longer part of the traceback) than just a single word (name of the exception - TimeoutError) for retry reasons. I believe it might be valuable.

Or at least have an option to include a whole/shortened version of the traceback. We can also open a follow-up issue to this.

However, if we have consensus with this new state, I am ok with that.

Pijukatel · 2025-04-28T11:41:15Z

Based on the feedback from the call, this is the example log now:

[crawlee.crawlers._basic._basic_crawler] WARN  Retrying request to http://a.com/ due to: Request handler timed out after 1.0 seconds 
File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1336, in handler,     await asyncio.sleep(10)  # INJECTED DELAY
[crawlee.crawlers._basic._basic_crawler] WARN  Retrying request to http://a.com/ due to: Request handler timed out after 1.0 seconds 
File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1336, in handler,     await asyncio.sleep(10)  # INJECTED DELAY
[crawlee.crawlers._basic._basic_crawler] ERROR Request to http://a.com/ failed and reached maximum retries
   File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
    await final_context_consumer(cast('TCrawlingContext', crawling_context))
  File "/home/pijukatel/repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
    return await self._default_handler(context)
  File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1336, in handler
    await asyncio.sleep(10)  # INJECTED DELAY
TimeoutError: Request handler timed out after 1.0 seconds

vdusek

One more thing - since this is user-facing change, I believe it should be included in the changelog, maybe fix?

Pijukatel · 2025-04-28T13:27:54Z

One more thing - since this is user-facing change, I believe it should be included in the changelog, maybe fix?

I will add feat, since it was not broken before.

Pijukatel added 4 commits April 22, 2025 17:07

Draft

9b32634

Hacky way of cleaning.

cf90e33

More robust log filtering

e35bdbf

Format and cleanup

9499491

Pijukatel added enhancement New feature or request. adhoc Ad-hoc unplanned task added during the sprint. labels Apr 23, 2025

github-actions bot assigned Pijukatel Apr 23, 2025

github-actions bot added this to the 113rd sprint - Tooling team milestone Apr 23, 2025

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Apr 23, 2025

Pijukatel requested a review from Copilot April 23, 2025 10:49

Copilot AI reviewed Apr 23, 2025

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Show resolved Hide resolved

Pijukatel requested review from vdusek and B4nan April 23, 2025 11:11

Pijukatel marked this pull request as ready for review April 23, 2025 11:12

Pijukatel added 2 commits April 23, 2025 14:56

Further compress the error log for timeouts

19bc9c5

Ignore stac trace highlighters as well

5ae90bb

Properly strip pep657 highlighters

af2bf1f

Add online retry warning with extra stacktrace line for Asyncio timeouts

Generalize one line summary for other errors as well

3bbdcec

Pijukatel force-pushed the cleaner-error-logging branch from 3214eb2 to 3bbdcec Compare April 24, 2025 09:49

Pijukatel added 2 commits April 24, 2025 11:59

Expand strip pep657 regexp

c526354

Merge remote-tracking branch 'origin/master' into cleaner-error-logging

c2d9f23

vdusek reviewed Apr 25, 2025

View reviewed changes

Pijukatel added 2 commits April 25, 2025 14:26

Merge remote-tracking branch 'origin/master' into cleaner-error-logging

d1bc4ad

Regenerate lock with uv=0.6.15

3a6a59d

Pijukatel requested a review from vdusek April 25, 2025 12:30

vdusek approved these changes Apr 25, 2025

View reviewed changes

Pijukatel added 2 commits April 28, 2025 13:38

Update retry error based on discussion 2 lines

95c62e0

Merge remote-tracking branch 'origin/master' into cleaner-error-logging

284a9cc

vdusek reviewed Apr 28, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into cleaner-error-logging

415038b

Pijukatel changed the title ~~chore: Cleaner timeout error logging~~ fix: Make timeout error log cleaner Apr 28, 2025

Pijukatel changed the title ~~fix: Make timeout error log cleaner~~ feat: Make timeout error log cleaner Apr 28, 2025

Pijukatel merged commit 78ea9d2 into master Apr 28, 2025
25 checks passed

Pijukatel deleted the cleaner-error-logging branch April 28, 2025 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Make timeout error log cleaner #1170

feat: Make timeout error log cleaner #1170

Pijukatel commented Apr 23, 2025 •

edited

Loading

Copilot AI left a comment

Pijukatel commented Apr 23, 2025 •

edited

Loading

B4nan commented Apr 23, 2025

Pijukatel commented Apr 23, 2025 •

edited

Loading

B4nan commented Apr 23, 2025

B4nan commented Apr 23, 2025

Pijukatel commented Apr 24, 2025

Pijukatel commented Apr 24, 2025

vdusek left a comment

vdusek left a comment

Pijukatel commented Apr 28, 2025

vdusek left a comment

Pijukatel commented Apr 28, 2025

feat: Make timeout error log cleaner #1170

feat: Make timeout error log cleaner #1170

Conversation

Pijukatel commented Apr 23, 2025 • edited Loading

Description

Issues

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Pijukatel commented Apr 23, 2025 • edited Loading

Timeout in user defined request handler:

Timeout in some internal part of the crawlee pipeline:

Timeout in user defined pre-navigation hook:

B4nan commented Apr 23, 2025

Pijukatel commented Apr 23, 2025 • edited Loading

B4nan commented Apr 23, 2025

B4nan commented Apr 23, 2025

Pijukatel commented Apr 24, 2025

Pijukatel commented Apr 24, 2025

vdusek left a comment

Choose a reason for hiding this comment

vdusek left a comment

Choose a reason for hiding this comment

Pijukatel commented Apr 28, 2025

vdusek left a comment

Choose a reason for hiding this comment

Pijukatel commented Apr 28, 2025

Pijukatel commented Apr 23, 2025 •

edited

Loading

Pijukatel commented Apr 23, 2025 •

edited

Loading

Pijukatel commented Apr 23, 2025 •

edited

Loading