Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Cheerio content crawler #52

Merged
merged 13 commits into from
Mar 7, 2025

Conversation

matyascimbulka
Copy link
Collaborator

Based on #48 this PR implements Cheerio crawler as 2nd option for crawling target web pages. To enable this feature a useCheerioCrawler input property and query parameter have been added. The default value for these options is false.

The Cheerio crawler works in both standby and normal mode. In standby mode the Actor will pre-create all 3 crawler (searchCrawler, playwrightContentCrawler and cheerioContentCrowler).

@metalwarrior665 @jirispilka I seem to be unable to add anyone for code review. Could you add yourself?

Copy link
Collaborator

@jirispilka jirispilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! A couple of minor comments below.

I think the major questions is, whether to introduce crawlerType instead of useCheerioCrawler. I would opt for crawlerType. It will simplify code and is future-proof

Comment on lines 139 to 144
"useCheerioCrawler": {
"title": "Use Cheerio Crawler",
"type": "boolean",
"description": "If enabled, the Actor uses the Cheerio Crawler to extract the target web page content.",
"default": false
},
Copy link
Collaborator

@jirispilka jirispilka Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename input attribute useCheerioCrawler.

I would make it similar to WCC: crawlerType: https://apify.com/apify/website-content-crawler/input-schema#crawlerType

This would make it also future-proof, e.g. when we need to add a new one. I think @metalwarrior665 posted some new browser, specifically for AI use-case (I forgot the name).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, switching to crawlerType input is more future-proof and flexible solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the dropdown. Not sure if we should call it Cheerio since the users should not care how is the crawler class called in Crawlee. I would also not mention "crawler", we are not even crawling much, it just scrapes a few pages from Google. I would define it as "browser" vs "plain HTTP". Not sure about the field name, maybe something like scrapingTool?

* Get existing crawler based on crawlerOptions and scraperSettings, if not present -> create new
*/
export const addPlaywrightCrawlRequest = async (
request: RequestOptions<PlaywrightCrawlerUserData>,
export const addContentCrawlRequest = async (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! contentCrawler is indeed a way better name

src/main.ts Outdated
Comment on lines 81 to 84
await createAndStartAllCrawlers(
searchCrawlerOptions,
contentCrawlerOptions,
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
await createAndStartAllCrawlers(
searchCrawlerOptions,
contentCrawlerOptions,
);
await createAndStartAllCrawlers(searchCrawlerOptions, contentCrawlerOptions);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can fit into single line, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now experimenting with Prettier so soon we might see more spreading in the code :/

src/main.ts Outdated
});
} else {
log.info('Actor is running in the NORMAL mode.');
try {
await handleSearchNormalMode(input, cheerioCrawlerOptions, playwrightCrawlerOptions, playwrightScraperSettings);
await handleSearchNormalMode(input, searchCrawlerOptions, contentCrawlerOptions[0], contentScraperSettings);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to dig into the code to understand what’s in the first element of the contentCrawlerOptions[0] array, which isn’t ideal.

A more straightforward approach might be to introduce a crawlerType field. That way, contentCrawlerOptions could be set directly instead of using an array.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this isn't ideal solution. The reason why processInput function returns an array of contentCrawlerOptions is the pre-creation of all browser where I need options for all of the content crawlers.

To make this more understandable I could create separate functions to process the input for standalone and standby modes. Both functions would internally use the already existing function for processing input but the would return the data in better format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad, someone reordering the push can break this. I think it would be better to create distinct types for normal vs standby start so it is clear one starts cheerio | playwright while other starts both

src/utils.ts Outdated
@@ -38,16 +38,16 @@ export function randomId() {
* The maxResults parameter is passed to the UserData object, when the request is handled it is used to limit
* the number of search results without the created overhead.
*
* Also add the playwrightCrawlerKey to the UserData object to be able to identify the playwright crawler should
* Also add the contentCrawlerKey to the UserData object to be able to identify the conten crawler should
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Also add the contentCrawlerKey to the UserData object to be able to identify the conten crawler should
* Also add the contentCrawlerKey to the UserData object to be able to identify the content crawler should


async function handleContent(
$: CheerioCrawlingContext['$'],
crawlerName: 'playwright' | 'cheerio',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have crawlerType defined, we can also use it here

src/search.ts Outdated
playwrightCrawlerOptions,
const { contentCrawlerKey } = await createAndStartCrawlers(
searchCrawlerOptions,
contentCrawlerOptions[0],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here, when reading the code, one is not sure what is at element [0] without a context

src/types.ts Outdated
| 'cheerio-request-end'
| 'cheerio-request-handler-start'
| 'cheerio-before-response-send'
| 'cheerio-failed-request'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheerio-failed-request - duplicate declaration

Copy link
Member

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just minor things. Could you also share some test runs?

Comment on lines 139 to 144
"useCheerioCrawler": {
"title": "Use Cheerio Crawler",
"type": "boolean",
"description": "If enabled, the Actor uses the Cheerio Crawler to extract the target web page content.",
"default": false
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the dropdown. Not sure if we should call it Cheerio since the users should not care how is the crawler class called in Crawlee. I would also not mention "crawler", we are not even crawling much, it just scrapes a few pages from Google. I would define it as "browser" vs "plain HTTP". Not sure about the field name, maybe something like scrapingTool?

"useCheerioCrawler": {
"title": "Use Cheerio Crawler",
"type": "boolean",
"description": "If enabled, the Actor uses the Cheerio Crawler to extract the target web page content.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description should explain why to choose it and give some performance examples (how long it takes with browser vs plain HTTP)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point with the browser vs plain HTTP, and also scrapingTool sounds good

src/crawlers.ts Outdated
* Creates and starts a Google search crawler and content crawlers for all provided configurations.
* A crawler won't be created if it already exists.
*/
export async function createAndStartAllCrawlers(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one function called createAndStartCrawlers and other createAndStartAllCrawlers is really confusing. I see the only difference is that first starts only one content crawler while other both? And that is probably because how this works differently in normal vs standby start? I would either get rid of this function and just inline the code or make it one and pass some parameters in to choose which

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the difference here is in standalone vs standby mode. I'll remove the createAndStartAllCrawlers function and figure out how to use the createAndStartCrawlers for both usecases.

src/crawlers.ts Outdated
failedRequestHandler: ({ request }, err) => failedRequestHandlerPlaywright(request, err),
});
// Typeguard to determine if we should use Playwright or Cheerio crawler
const usePlaywrightCrawler = 'browserPoolOptions' in crawlerOptions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax 'string in object' doesn't typecheck (if you do 'garbage' in crawlerOptions;, you will get no type warning and return false in runtime). This is bad because you rely on browserPoolOptions which might get removed in the future.

You probably need to do discriminated union, this means adding type: 'cheerio' | 'playwright' to the PlaywrightCrawlerOptions and CheerioCrawlerOptions. You could make a wrapper type like this

type CrawlerOptions = { type: 'cheerio', crawlerOptions: CheerioCrawlerOptions } | { type: 'playwright', crawlerOptions: PlaywrightCrawlerOptions }

const fn = (options: CrawlerOptions): CheerioCrawler | PlaywrightCrawler => {
    const { type, crawlerOptions } = options;

    if (type === 'cheerio') {
        return new CheerioCrawler(crawlerOptions);
    }
    return new PlaywrightCrawler(crawlerOptions);
}

Another way would be to remove some layer of function call so you don't have to pass this around as union.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the advice. I'll rework this to not rely on internal properties of the crawler options.

@matyascimbulka
Copy link
Collaborator Author

Thank you for the reviews. I have modified the code based on your comments. In regards to the input schema I implemented the scraperTool property with playwright and cheerio options available. These options are show in UI as Browser and Plain HTML respectively.

Here is test run of the standalone mode: https://console.apify.com/view/runs/hE1lzBbG0dmWlDSMD
And here is a test run of standby mode: https://console.apify.com/view/runs/0JfPIEVGJUCmSnzIo

Copy link
Member

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just touches on the schema and we can go. I think there are likely opportunities for refactoring, the managing of inputs/crawlers still seems a bit more complicated than needed but we should not block this PR.

"description": "Choose what scraping tool to use for extracting the target web pages. The Browser tool is more powerful and can handle JavaScript heavy websites. While the Plain HTML tool is about two times faster.",
"editor": "select",
"default": "playwright",
"enum": ["playwright", "cheerio"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to propose to sync these to enumTitles because if these are different, it creates mess with debugging and discussion later. But maybe it would be better to expand the enumTitles as Browser (uses Playwright)?

"editor": "select",
"default": "playwright",
"enum": ["playwright", "cheerio"],
"enumTitles": ["Browser", "Plain HTML"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant Plain HTTP but honestly not sure what is the best term. WCC uses Raw HTTP so maybe let's use that too, we have same/similar users.

@@ -1,4 +1,4 @@
import inputSchema from '../.actor/input_schema.json' assert { type: 'json' };
import inputSchema from '../.actor/input_schema.json' with { type: 'json' };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing. I see this in the log (node:21) ExperimentalWarning: Importing JSON modules is an experimental feature and might change at any time

Can you try to suppress this warning? If that would not work, then try to increase Node.js version (but that could rarely break other things so let's try suppress first)

@matyascimbulka
Copy link
Collaborator Author

I have changed the values in the input schema (and corresponding constants in code). Also I have managed to suppress the experimental warning from node.

Here is a run of this version: https://console.apify.com/view/runs/ZIHv6dXe1hmlly0aX

Copy link
Member

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

Copy link
Collaborator

@jirispilka jirispilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

I really like it, added new functionallity and the code looks better than before 🙇🏻
There was one typo - I added a suggestion.

I've prepared an update for readme and input_schema.json (reorder fields to move scrapingTool more to the top) in this branch I can merge it once you merge this.

Co-authored-by: Jiří Spilka <[email protected]>
@matyascimbulka
Copy link
Collaborator Author

matyascimbulka commented Mar 7, 2025

Thanks for the feedback. I have fixed the typo. If you're happy with it we can merge the PR. But I can't do it since I don't have the button available.

@jirispilka
Copy link
Collaborator

Thanks for the feedback. I have fixed the typo. If you're happy with it we can merge the PR. But I can't do it since I don't have the button available.

@matyascimbulka I thought anyone at Apify can do it. I'm not sure what is a correct setting. I added you, you should be able to merge it now.

@matyascimbulka matyascimbulka merged commit 7d940f8 into apify:master Mar 7, 2025
1 check passed
@matyascimbulka
Copy link
Collaborator Author

I have merged the PR. Do we need to build it in console or is automated?

@jirispilka
Copy link
Collaborator

Thank you

I have merged the PR. Do we need to build it in console or is automated?

No, we need to build in the console. Let us first merge this PR and then build it. Will you please build it and test briefly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants