feat: Add Cheerio content crawler #52

matyascimbulka · 2025-03-04T13:11:36Z

Based on #48 this PR implements Cheerio crawler as 2nd option for crawling target web pages. To enable this feature a useCheerioCrawler input property and query parameter have been added. The default value for these options is false.

The Cheerio crawler works in both standby and normal mode. In standby mode the Actor will pre-create all 3 crawler (searchCrawler, playwrightContentCrawler and cheerioContentCrowler).

@metalwarrior665 @jirispilka I seem to be unable to add anyone for code review. Could you add yourself?

This changes simplifies pre-creation of all crawlers during startup of STANDBY mode.

jirispilka

Thank you! A couple of minor comments below.

I think the major questions is, whether to introduce crawlerType instead of useCheerioCrawler. I would opt for crawlerType. It will simplify code and is future-proof

jirispilka · 2025-03-05T09:23:31Z

.actor/input_schema.json

+        "useCheerioCrawler": {
+            "title": "Use Cheerio Crawler",
+            "type": "boolean",
+            "description": "If enabled, the Actor uses the Cheerio Crawler to extract the target web page content.",
+            "default": false
+        },


Should we rename input attribute useCheerioCrawler.

I would make it similar to WCC: crawlerType: https://apify.com/apify/website-content-crawler/input-schema#crawlerType

This would make it also future-proof, e.g. when we need to add a new one. I think @metalwarrior665 posted some new browser, specifically for AI use-case (I forgot the name).

I agree, switching to crawlerType input is more future-proof and flexible solution.

Agree on the dropdown. Not sure if we should call it Cheerio since the users should not care how is the crawler class called in Crawlee. I would also not mention "crawler", we are not even crawling much, it just scrapes a few pages from Google. I would define it as "browser" vs "plain HTTP". Not sure about the field name, maybe something like scrapingTool?

jirispilka · 2025-03-05T09:26:13Z

src/crawlers.ts

 * Get existing crawler based on crawlerOptions and scraperSettings, if not present -> create new
 */
-export const addPlaywrightCrawlRequest = async (
-    request: RequestOptions<PlaywrightCrawlerUserData>,
+export const addContentCrawlRequest = async (


Thanks! contentCrawler is indeed a way better name

jirispilka · 2025-03-05T09:28:10Z

src/main.ts

+        await createAndStartAllCrawlers(
+            searchCrawlerOptions,
+            contentCrawlerOptions,
+        );


Suggested change

await createAndStartAllCrawlers(

searchCrawlerOptions,

contentCrawlerOptions,

);

await createAndStartAllCrawlers(searchCrawlerOptions, contentCrawlerOptions);

This can fit into single line, right?

We are now experimenting with Prettier so soon we might see more spreading in the code :/

jirispilka · 2025-03-05T11:00:05Z

src/main.ts

    });
 } else {
    log.info('Actor is running in the NORMAL mode.');
    try {
-        await handleSearchNormalMode(input, cheerioCrawlerOptions, playwrightCrawlerOptions, playwrightScraperSettings);
+        await handleSearchNormalMode(input, searchCrawlerOptions, contentCrawlerOptions[0], contentScraperSettings);


You have to dig into the code to understand what’s in the first element of the contentCrawlerOptions[0] array, which isn’t ideal.

A more straightforward approach might be to introduce a crawlerType field. That way, contentCrawlerOptions could be set directly instead of using an array.

I agree that this isn't ideal solution. The reason why processInput function returns an array of contentCrawlerOptions is the pre-creation of all browser where I need options for all of the content crawlers.

To make this more understandable I could create separate functions to process the input for standalone and standby modes. Both functions would internally use the already existing function for processing input but the would return the data in better format.

This is bad, someone reordering the push can break this. I think it would be better to create distinct types for normal vs standby start so it is clear one starts cheerio | playwright while other starts both

jirispilka · 2025-03-05T11:01:45Z

src/utils.ts

@@ -38,16 +38,16 @@ export function randomId() {
 * The maxResults parameter is passed to the UserData object, when the request is handled it is used to limit
 * the number of search results without the created overhead.
 *
- * Also add the playwrightCrawlerKey to the UserData object to be able to identify the playwright crawler should
+ * Also add the contentCrawlerKey to the UserData object to be able to identify the conten crawler should


Suggested change

* Also add the contentCrawlerKey to the UserData object to be able to identify the conten crawler should

* Also add the contentCrawlerKey to the UserData object to be able to identify the content crawler should

jirispilka · 2025-03-05T12:13:07Z

src/request-handler.ts

+
+async function handleContent(
+    $: CheerioCrawlingContext['$'],
+    crawlerName: 'playwright' | 'cheerio',


If we have crawlerType defined, we can also use it here

jirispilka · 2025-03-05T12:16:35Z

src/search.ts

-        playwrightCrawlerOptions,
+    const { contentCrawlerKey } = await createAndStartCrawlers(
+        searchCrawlerOptions,
+        contentCrawlerOptions[0],


The same here, when reading the code, one is not sure what is at element [0] without a context

jirispilka · 2025-03-05T12:18:09Z

src/types.ts

        | 'cheerio-request-end'
        | 'cheerio-request-handler-start'
+        | 'cheerio-before-response-send'
+        | 'cheerio-failed-request'


cheerio-failed-request - duplicate declaration

metalwarrior665

Thanks, just minor things. Could you also share some test runs?

metalwarrior665 · 2025-03-05T20:50:10Z

.actor/input_schema.json

+        "useCheerioCrawler": {
+            "title": "Use Cheerio Crawler",
+            "type": "boolean",
+            "description": "If enabled, the Actor uses the Cheerio Crawler to extract the target web page content.",
+            "default": false
+        },


Agree on the dropdown. Not sure if we should call it Cheerio since the users should not care how is the crawler class called in Crawlee. I would also not mention "crawler", we are not even crawling much, it just scrapes a few pages from Google. I would define it as "browser" vs "plain HTTP". Not sure about the field name, maybe something like scrapingTool?

metalwarrior665 · 2025-03-05T20:51:58Z

.actor/input_schema.json

+        "useCheerioCrawler": {
+            "title": "Use Cheerio Crawler",
+            "type": "boolean",
+            "description": "If enabled, the Actor uses the Cheerio Crawler to extract the target web page content.",


The description should explain why to choose it and give some performance examples (how long it takes with browser vs plain HTTP)

Good point with the browser vs plain HTTP, and also scrapingTool sounds good

metalwarrior665 · 2025-03-05T20:55:29Z

src/crawlers.ts

+ * Creates and starts a Google search crawler and content crawlers for all provided configurations.
+ * A crawler won't be created if it already exists.
+ */
+export async function createAndStartAllCrawlers(


Having one function called createAndStartCrawlers and other createAndStartAllCrawlers is really confusing. I see the only difference is that first starts only one content crawler while other both? And that is probably because how this works differently in normal vs standby start? I would either get rid of this function and just inline the code or make it one and pass some parameters in to choose which

Yes, the difference here is in standalone vs standby mode. I'll remove the createAndStartAllCrawlers function and figure out how to use the createAndStartCrawlers for both usecases.

metalwarrior665 · 2025-03-05T21:12:14Z

src/crawlers.ts

-        failedRequestHandler: ({ request }, err) => failedRequestHandlerPlaywright(request, err),
-    });
+    // Typeguard to determine if we should use Playwright or Cheerio crawler
+    const usePlaywrightCrawler = 'browserPoolOptions' in crawlerOptions;


The syntax 'string in object' doesn't typecheck (if you do 'garbage' in crawlerOptions;, you will get no type warning and return false in runtime). This is bad because you rely on browserPoolOptions which might get removed in the future.

You probably need to do discriminated union, this means adding type: 'cheerio' | 'playwright' to the PlaywrightCrawlerOptions and CheerioCrawlerOptions. You could make a wrapper type like this

type CrawlerOptions = { type: 'cheerio', crawlerOptions: CheerioCrawlerOptions } | { type: 'playwright', crawlerOptions: PlaywrightCrawlerOptions } const fn = (options: CrawlerOptions): CheerioCrawler | PlaywrightCrawler => { const { type, crawlerOptions } = options; if (type === 'cheerio') { return new CheerioCrawler(crawlerOptions); } return new PlaywrightCrawler(crawlerOptions); }

Another way would be to remove some layer of function call so you don't have to pass this around as union.

Thank you for the advice. I'll rework this to not rely on internal properties of the crawler options.

matyascimbulka · 2025-03-06T10:37:58Z

Thank you for the reviews. I have modified the code based on your comments. In regards to the input schema I implemented the scraperTool property with playwright and cheerio options available. These options are show in UI as Browser and Plain HTML respectively.

Here is test run of the standalone mode: https://console.apify.com/view/runs/hE1lzBbG0dmWlDSMD
And here is a test run of standby mode: https://console.apify.com/view/runs/0JfPIEVGJUCmSnzIo

metalwarrior665

Just touches on the schema and we can go. I think there are likely opportunities for refactoring, the managing of inputs/crawlers still seems a bit more complicated than needed but we should not block this PR.

metalwarrior665 · 2025-03-06T14:03:14Z

.actor/input_schema.json

+            "description": "Choose what scraping tool to use for extracting the target web pages. The Browser tool is more powerful and can handle JavaScript heavy websites. While the Plain HTML tool is about two times faster.",
+            "editor": "select",
+            "default": "playwright",
+            "enum": ["playwright", "cheerio"],


I wanted to propose to sync these to enumTitles because if these are different, it creates mess with debugging and discussion later. But maybe it would be better to expand the enumTitles as Browser (uses Playwright)?

metalwarrior665 · 2025-03-06T14:05:08Z

.actor/input_schema.json

+            "editor": "select",
+            "default": "playwright",
+            "enum": ["playwright", "cheerio"],
+            "enumTitles": ["Browser", "Plain HTML"]


I meant Plain HTTP but honestly not sure what is the best term. WCC uses Raw HTTP so maybe let's use that too, we have same/similar users.

metalwarrior665 · 2025-03-06T14:14:27Z

src/const.ts

@@ -1,4 +1,4 @@
-import inputSchema from '../.actor/input_schema.json' assert { type: 'json' };
+import inputSchema from '../.actor/input_schema.json' with { type: 'json' };


One more thing. I see this in the log (node:21) ExperimentalWarning: Importing JSON modules is an experimental feature and might change at any time

Can you try to suppress this warning? If that would not work, then try to increase Node.js version (but that could rarely break other things so let's try suppress first)

matyascimbulka · 2025-03-06T15:00:42Z

I have changed the values in the input schema (and corresponding constants in code). Also I have managed to suppress the experimental warning from node.

Here is a run of this version: https://console.apify.com/view/runs/ZIHv6dXe1hmlly0aX

metalwarrior665

Great job!

jirispilka

Thank you!

I really like it, added new functionallity and the code looks better than before 🙇🏻
There was one typo - I added a suggestion.

I've prepared an update for readme and input_schema.json (reorder fields to move scrapingTool more to the top) in this branch I can merge it once you merge this.

src/main.ts

Co-authored-by: Jiří Spilka <[email protected]>

matyascimbulka · 2025-03-07T07:20:06Z

Thanks for the feedback. I have fixed the typo. If you're happy with it we can merge the PR. But I can't do it since I don't have the button available.

jirispilka · 2025-03-07T07:51:07Z

Thanks for the feedback. I have fixed the typo. If you're happy with it we can merge the PR. But I can't do it since I don't have the button available.

@matyascimbulka I thought anyone at Apify can do it. I'm not sure what is a correct setting. I added you, you should be able to merge it now.

matyascimbulka · 2025-03-07T09:35:57Z

I have merged the PR. Do we need to build it in console or is automated?

jirispilka · 2025-03-07T10:29:38Z

Thank you

I have merged the PR. Do we need to build it in console or is automated?

No, we need to build in the console. Let us first merge this PR and then build it. Will you please build it and test briefly?

matyascimbulka added 6 commits March 4, 2025 10:44

feat: Add cheerio crawler as option for content crawling

22a5cc1

refactor: Consolidate req handlers into one file

3d4f0b6

refactor: Rename options object for search crawler

ca580d6

feat: Process input return array of content crawler options

71ca906

This changes simplifies pre-creation of all crawlers during startup of STANDBY mode.

feat: Add function to start all crawlers

e7d3150

feat: Update README

5ba88bd

jirispilka requested review from jirispilka and metalwarrior665 March 4, 2025 13:27

jirispilka requested changes Mar 5, 2025

View reviewed changes

metalwarrior665 requested changes Mar 5, 2025

View reviewed changes

matyascimbulka added 5 commits March 6, 2025 09:42

fix: Fix run issues on platform

63d100a

feat: Add ContentCrawlerOptions for better type checking

c5c09aa

feat: Change input flag to select and remove createAndStartCrawlers

e2cf832

refactor: Add ContentCrawlerType enum

1c9f608

feat: Update README

ab242e7

matyascimbulka requested review from jirispilka and metalwarrior665 March 6, 2025 10:38

metalwarrior665 requested changes Mar 6, 2025

View reviewed changes

metalwarrior665 reviewed Mar 6, 2025

View reviewed changes

feat: Modify inputs and supress node warnings

608c0e6

matyascimbulka requested a review from metalwarrior665 March 6, 2025 15:01

metalwarrior665 approved these changes Mar 6, 2025

View reviewed changes

jirispilka approved these changes Mar 6, 2025

View reviewed changes

src/main.ts Outdated Show resolved Hide resolved

refactor: Fix typo in main.ts

eb5260c

Co-authored-by: Jiří Spilka <[email protected]>

matyascimbulka merged commit 7d940f8 into apify:master Mar 7, 2025
1 check passed

jirispilka mentioned this pull request Mar 11, 2025

Consider to add CheerioCrawler for the web pages #48

Closed

	* Also add the contentCrawlerKey to the UserData object to be able to identify the conten crawler should
	* Also add the contentCrawlerKey to the UserData object to be able to identify the content crawler should

		@@ -1,4 +1,4 @@
		import inputSchema from '../.actor/input_schema.json' assert { type: 'json' };
		import inputSchema from '../.actor/input_schema.json' with { type: 'json' };

feat: Add Cheerio content crawler #52

feat: Add Cheerio content crawler #52

Conversation

matyascimbulka commented Mar 4, 2025

jirispilka left a comment • edited Loading

Choose a reason for hiding this comment

jirispilka Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metalwarrior665 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matyascimbulka commented Mar 6, 2025

metalwarrior665 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matyascimbulka commented Mar 6, 2025

metalwarrior665 left a comment

Choose a reason for hiding this comment

jirispilka left a comment

Choose a reason for hiding this comment

matyascimbulka commented Mar 7, 2025 • edited Loading

jirispilka commented Mar 7, 2025

matyascimbulka commented Mar 7, 2025

jirispilka commented Mar 7, 2025

jirispilka left a comment •

edited

Loading

jirispilka Mar 5, 2025 •

edited

Loading

matyascimbulka commented Mar 7, 2025 •

edited

Loading