-
Notifications
You must be signed in to change notification settings - Fork 376
feat: Make timeout error log cleaner #1170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR makes timeout-based error logs in the BasicCrawler more readable by refining the logging messages and reducing traceback clutter. Key changes include:
- Adding a new helper method (_get_message_from_error) to filter and format error messages.
- Modifying _handle_failed_request to log errors using the new formatted messages.
- Adding a new unit test (test_reduced_logs_from_timed_out_request_handler) to verify that the simplified logs meet expectations.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/unit/crawlers/_basic/test_basic_crawler.py | Adds a unit test ensuring the log output is reduced and includes custom log comments. |
src/crawlee/crawlers/_basic/_basic_crawler.py | Refines error logging by introducing helper methods and updating error message formatting. |
Example log reduction: Look for # INJECTED TO PROVOKE TIMEOUT in the logs as that is the relevant line that timed out. Timeout in user defined request handler:Before:
After:
Timeout in some internal part of the crawlee pipeline:Before:
After:
Timeout in user defined pre-navigation hook:Before:
After:
|
Yes, I can safely drop the "Traceback (most recent call last):", "asyncio.exceptions.CancelledError" and blank lines, but I would not recommend to go any further. The timeout error in this place can actually originate from our code as well(not only from user code) and at least minimal trace-back is necessary to debug it. So final result would be like:
I am only focusing on the final error ( ERROR Request failed and reached maximum retries). In Python version I do not think we even print those failed retries warnings unless it reaches maximum retry and then it is error. |
My point is we can detect that and skip the irrelevant stuff based on that.
IMO we should print them, you should know when things get retried. But it should definitely be just a one liner for those, since it should just inform you things are retrying. If we skip those, you could see nothing for a few minutes, we need to show the progress. |
I am actually quite skeptical about the possibility of reducing stack trace to single line in general case, while maintaining the required information value. I guess the best we can do about one line log for retries is something like f"Retrying request to {request_url} due to {top_level_error_type} {optional_one_line_that_can_be_extracted_sometimes}" I think we can reduce exception to meaningful one line only in some specific cases(lets start with timeout and add more if possible) and for others this retry log would not have the one line optional part. |
Add online retry warning with extra stacktrace line for Asyncio timeouts
With latest commit it would do for example this
|
3214eb2
to
3bbdcec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update uv
to the latest version to reduce the upload_time
field-related changes in the lock file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I would slightly prefer having a longer explanation (longer part of the traceback) than just a single word (name of the exception - TimeoutError
) for retry reasons. I believe it might be valuable.
Or at least have an option to include a whole/shortened version of the traceback. We can also open a follow-up issue to this.
However, if we have consensus with this new state, I am ok with that.
Based on the feedback from the call, this is the example log now:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing - since this is user-facing change, I believe it should be included in the changelog, maybe fix?
I will add feat, since it was not broken before. |
Description
Make timeout based error logs in request handler more readable by filtering clutter in stack trace.
Add one line warning with optional most important stack trace line for failed request handler calls.
Issues