-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: performance evaluation #61
base: feat/block-media
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing the changes. There was no need to start new crawler for blocking media.
I'm not sure why the blockRequests
function doesn't work. But the page.route
function seems to be the way to go for this use case (outside of Crawlee).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 And thank you for fixing this, I haven't noticed that it spawns another crawler instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's test the perf a bit more
* Only blocks resources if blockMedia is true. | ||
*/ | ||
async function blockMediaResourcesHook({ page, request }: PlaywrightCrawlingContext<ContentCrawlerUserData>) { | ||
await page.route('**/*', async (route) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page.route
disables native browser cache which is why blockRequests
is normally recommended (that is a native Chromium CDP call). The cache disabling is only bad if you do more requests for the same site. I would do a perf test on more URLs of the same site and test more sites because this could slow us down as well.
Based on @matyascimbulka's suggestion, I refactored the code and moved
preNavigationHooks
to a separate function so that selectingblockMedia: true/false
does not create a new instance of the crawler.There may be a better way to block media, but it didn’t work for me—perhaps @metalwarrior665 can help here?
Another issue (#60) in standby mode causes multiple crawlers to be created without reason. I’ll leave this for a separate PR.
And some number not as good as I hoped for but still it is an improvement
