Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: performance evaluation #61

Open
wants to merge 8 commits into
base: feat/block-media
Choose a base branch
from
Open

Conversation

jirispilka
Copy link
Collaborator

Based on @matyascimbulka's suggestion, I refactored the code and moved preNavigationHooks to a separate function so that selecting blockMedia: true/false does not create a new instance of the crawler.

There may be a better way to block media, but it didn’t work for me—perhaps @metalwarrior665 can help here?

preNavigationHooks: [
    async ({ blockRequests }) => {
        // Block all requests to URLs that include `adsbygoogle.js` and also all defaults.
        await blockRequests({
            extraUrlPatterns: ['adsbygoogle.js'],
        });
    },
],

Another issue (#60) in standby mode causes multiple crawlers to be created without reason. I’ll leave this for a separate PR.

And some number not as good as I hoped for but still it is an improvement
image

Copy link
Collaborator

@matyascimbulka matyascimbulka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing the changes. There was no need to start new crawler for blocking media.

I'm not sure why the blockRequests function doesn't work. But the page.route function seems to be the way to go for this use case (outside of Crawlee).

Copy link
Contributor

@MQ37 MQ37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 And thank you for fixing this, I haven't noticed that it spawns another crawler instance.

Copy link
Member

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's test the perf a bit more

* Only blocks resources if blockMedia is true.
*/
async function blockMediaResourcesHook({ page, request }: PlaywrightCrawlingContext<ContentCrawlerUserData>) {
await page.route('**/*', async (route) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page.route disables native browser cache which is why blockRequests is normally recommended (that is a native Chromium CDP call). The cache disabling is only bad if you do more requests for the same site. I would do a perf test on more URLs of the same site and test more sites because this could slow us down as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants