Skip to content

feat: Make timeout error log cleaner #1170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 28, 2025
Merged

feat: Make timeout error log cleaner #1170

merged 15 commits into from
Apr 28, 2025

Conversation

Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Apr 23, 2025

Description

Make timeout based error logs in request handler more readable by filtering clutter in stack trace.
Add one line warning with optional most important stack trace line for failed request handler calls.

Issues

@Pijukatel Pijukatel added enhancement New feature or request. adhoc Ad-hoc unplanned task added during the sprint. labels Apr 23, 2025
@github-actions github-actions bot added this to the 113rd sprint - Tooling team milestone Apr 23, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Apr 23, 2025
@Pijukatel Pijukatel requested a review from Copilot April 23, 2025 10:49
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR makes timeout-based error logs in the BasicCrawler more readable by refining the logging messages and reducing traceback clutter. Key changes include:

  • Adding a new helper method (_get_message_from_error) to filter and format error messages.
  • Modifying _handle_failed_request to log errors using the new formatted messages.
  • Adding a new unit test (test_reduced_logs_from_timed_out_request_handler) to verify that the simplified logs meet expectations.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/unit/crawlers/_basic/test_basic_crawler.py Adds a unit test ensuring the log output is reduced and includes custom log comments.
src/crawlee/crawlers/_basic/_basic_crawler.py Refines error logging by introducing helper methods and updating error message formatting.

@Pijukatel
Copy link
Collaborator Author

Pijukatel commented Apr 23, 2025

Example log reduction:

Look for # INJECTED TO PROVOKE TIMEOUT in the logs as that is the relevant line that timed out.

Timeout in user defined request handler:

Before:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
          await final_context_consumer(cast('TCrawlingContext', crawling_context))
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
          return await self._default_handler(context)
        File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler
          await asyncio.sleep(10)  # INJECTED TO PROVOKE TIMEOUT
        File "/usr/lib/python3.10/asyncio/tasks.py", line 605, in sleep
          return await future
      asyncio.exceptions.CancelledError

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
          return fut.result()
      asyncio.exceptions.CancelledError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
        File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
          raise exceptions.TimeoutError() from exc
      asyncio.exceptions.TimeoutError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1150, in __run_task_function
          await self._run_request_handler(context=context)
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1247, in _run_request_handler
          await wait_for(
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 39, in wait_for
          raise asyncio.TimeoutError(timeout_message) from ex
      asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

After:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
 Traceback (most recent call last):

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
    await final_context_consumer(cast('TCrawlingContext', crawling_context))

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
    return await self._default_handler(context)

  File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler
    await asyncio.sleep(10)  # INJECTED TO PROVOKE TIMEOUT

asyncio.exceptions.CancelledError

asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

Timeout in some internal part of the crawlee pipeline:

Before:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 68, in __call__
          result = await middleware_instance.__anext__()
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 811, in _check_url_after_redirects
          await asyncio.sleep(10) # INJECTED TO PROVOKE TIMEOUT
        File "/usr/lib/python3.10/asyncio/tasks.py", line 605, in sleep
          return await future
      asyncio.exceptions.CancelledError

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
          return fut.result()
      asyncio.exceptions.CancelledError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
        File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
          raise exceptions.TimeoutError() from exc
      asyncio.exceptions.TimeoutError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1151, in __run_task_function
          await self._run_request_handler(context=context)
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1248, in _run_request_handler
          await wait_for(
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 39, in wait_for
          raise asyncio.TimeoutError(timeout_message) from ex
      asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

After:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
 Traceback (most recent call last):

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 68, in __call__
    result = await middleware_instance.__anext__()

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 811, in _check_url_after_redirects
    await asyncio.sleep(10) # INJECTED TO PROVOKE TIMEOUT

asyncio.exceptions.CancelledError

asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

Timeout in user defined pre-navigation hook:

Before:

[BeautifulSoupCrawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 68, in __call__
          result = await middleware_instance.__anext__()
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py", line 112, in _execute_pre_navigation_hooks
          await hook(context)
        File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py", line 162, in pre_nav_hook
          await asyncio.sleep(10) # INJECTED TO PROVOKE TIMEOUT
        File "/usr/lib/python3.10/asyncio/tasks.py", line 605, in sleep
          return await future
      asyncio.exceptions.CancelledError

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
          return fut.result()
      asyncio.exceptions.CancelledError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
        File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
          raise exceptions.TimeoutError() from exc
      asyncio.exceptions.TimeoutError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1150, in __run_task_function
          await self._run_request_handler(context=context)
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1247, in _run_request_handler
          await wait_for(
        File "/home/pijukatel/repos/crawlee-python/src/crawlee/_utils/wait.py", line 39, in wait_for
          raise asyncio.TimeoutError(timeout_message) from ex
      asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

After:

[BeautifulSoupCrawler] ERROR Request failed and reached maximum retries
 Traceback (most recent call last):

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 68, in __call__
    result = await middleware_instance.__anext__()

  File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py", line 112, in _execute_pre_navigation_hooks
    await hook(context)

  File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py", line 162, in pre_nav_hook
    await asyncio.sleep(10) # INJECTED TO PROVOKE TIMEOUT

asyncio.exceptions.CancelledError

asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

@Pijukatel Pijukatel requested review from vdusek and B4nan April 23, 2025 11:11
@Pijukatel Pijukatel marked this pull request as ready for review April 23, 2025 11:12
@B4nan
Copy link
Member

B4nan commented Apr 23, 2025

It's surely better, but it still feels long. I would just show the single line that points to the originating file - the one from the user land, not from dependencies. And ideally get rid of all the other clutter like "Traceback (most recent call last):" or "asyncio.exceptions.CancelledError". Also I'd drop all the blank lines.

Keep in mind we are talking about request retries, those should only produce a warning. The final failure can be bigger since that is what you need to spot easily, but for retries we shouldn't pollute the logs.

Looking at how it works in the JS version, we don't even show the stack trace for timeouts and its all one liners:

image

@Pijukatel
Copy link
Collaborator Author

Pijukatel commented Apr 23, 2025

It's surely better, but it still feels long. I would just show the single line that points to the originating file - the one from the user land, not from dependencies. And ideally get rid of all the other clutter like "Traceback (most recent call last):" or "asyncio.exceptions.CancelledError". Also I'd drop all the blank lines.

Yes, I can safely drop the "Traceback (most recent call last):", "asyncio.exceptions.CancelledError" and blank lines, but I would not recommend to go any further. The timeout error in this place can actually originate from our code as well(not only from user code) and at least minimal trace-back is necessary to debug it. So final result would be like:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
   File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
    await final_context_consumer(cast('TCrawlingContext', crawling_context))
  File "/home/pijukatel/repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
    return await self._default_handler(context)
  File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler
    await asyncio.sleep(10)  # INJECTED TO PROVOKE TIMEOUT
asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

Keep in mind we are talking about request retries, those should only produce a warning. The final failure can be bigger since that is what you need to spot easily, but for retries we shouldn't pollute the logs.

Looking at how it works in the JS version, we don't even show the stack trace for timeouts and its all one liners:

I am only focusing on the final error ( ERROR Request failed and reached maximum retries). In Python version I do not think we even print those failed retries warnings unless it reaches maximum retry and then it is error.

@B4nan
Copy link
Member

B4nan commented Apr 23, 2025

The timeout error in this place can actually originate from our code as well(not only from user code) and at least minimal trace-back is necessary to debug it.

My point is we can detect that and skip the irrelevant stuff based on that.

In Python version I do not think we even print those failed retries warnings unless it reaches maximum retry and then it is error.

IMO we should print them, you should know when things get retried. But it should definitely be just a one liner for those, since it should just inform you things are retrying. If we skip those, you could see nothing for a few minutes, we need to show the progress.

@B4nan
Copy link
Member

B4nan commented Apr 23, 2025

Here is a better example of how the JS version behaves, all the request handler is doing is throwing an error. We show warnings with a one line stack trace pointing to the user code that triggered it, and for the final one we show the full stack trace.

image

@Pijukatel
Copy link
Collaborator Author

...But it should definitely be just a one liner for those, since it should just inform you things are retrying....

I am actually quite skeptical about the possibility of reducing stack trace to single line in general case, while maintaining the required information value. I guess the best we can do about one line log for retries is something like f"Retrying request to {request_url} due to {top_level_error_type} {optional_one_line_that_can_be_extracted_sometimes}"

I think we can reduce exception to meaningful one line only in some specific cases(lets start with timeout and add more if possible) and for others this retry log would not have the one line optional part.

Add online retry warning with extra stacktrace line for Asyncio timeouts
@Pijukatel
Copy link
Collaborator Author

With latest commit it would do for example this

[crawlee.crawlers._basic._basic_crawler] WARN  Retrying request to http://a.com/ due to TimeoutError. File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler,     await asyncio.sleep(10)  # INJECTED DELAY
[crawlee.crawlers._basic._basic_crawler] WARN  Retrying request to http://a.com/ due to TimeoutError. File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler,     await asyncio.sleep(10)  # INJECTED DELAY
[crawlee.crawlers._basic._basic_crawler] ERROR Request to http://a.com/ failed and reached maximum retries
   File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
    await final_context_consumer(cast('TCrawlingContext', crawling_context))
  File "/home/pijukatel/repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
    return await self._default_handler(context)
  File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1304, in handler
    await asyncio.sleep(10)  # INJECTED DELAY
TimeoutError: Request handler timed out after 1.0 seconds

@Pijukatel Pijukatel force-pushed the cleaner-error-logging branch from 3214eb2 to 3bbdcec Compare April 24, 2025 09:49
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update uv to the latest version to reduce the upload_time field-related changes in the lock file.

@Pijukatel Pijukatel requested a review from vdusek April 25, 2025 12:30
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I would slightly prefer having a longer explanation (longer part of the traceback) than just a single word (name of the exception - TimeoutError) for retry reasons. I believe it might be valuable.

Or at least have an option to include a whole/shortened version of the traceback. We can also open a follow-up issue to this.

However, if we have consensus with this new state, I am ok with that.

@Pijukatel
Copy link
Collaborator Author

Based on the feedback from the call, this is the example log now:

[crawlee.crawlers._basic._basic_crawler] WARN  Retrying request to http://a.com/ due to: Request handler timed out after 1.0 seconds 
File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1336, in handler,     await asyncio.sleep(10)  # INJECTED DELAY
[crawlee.crawlers._basic._basic_crawler] WARN  Retrying request to http://a.com/ due to: Request handler timed out after 1.0 seconds 
File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1336, in handler,     await asyncio.sleep(10)  # INJECTED DELAY
[crawlee.crawlers._basic._basic_crawler] ERROR Request to http://a.com/ failed and reached maximum retries
   File "/home/pijukatel/repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
    await final_context_consumer(cast('TCrawlingContext', crawling_context))
  File "/home/pijukatel/repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
    return await self._default_handler(context)
  File "/home/pijukatel/repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1336, in handler
    await asyncio.sleep(10)  # INJECTED DELAY
TimeoutError: Request handler timed out after 1.0 seconds

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing - since this is user-facing change, I believe it should be included in the changelog, maybe fix?

@Pijukatel Pijukatel changed the title chore: Cleaner timeout error logging fix: Make timeout error log cleaner Apr 28, 2025
@Pijukatel Pijukatel changed the title fix: Make timeout error log cleaner feat: Make timeout error log cleaner Apr 28, 2025
@Pijukatel
Copy link
Collaborator Author

One more thing - since this is user-facing change, I believe it should be included in the changelog, maybe fix?

I will add feat, since it was not broken before.

@Pijukatel Pijukatel merged commit 78ea9d2 into master Apr 28, 2025
25 checks passed
@Pijukatel Pijukatel deleted the cleaner-error-logging branch April 28, 2025 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve logging by suppresing irrelevant parts of stack trace
3 participants