-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Cheerio content crawler #52
feat: Add Cheerio content crawler #52
Conversation
This changes simplifies pre-creation of all crawlers during startup of STANDBY mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! A couple of minor comments below.
I think the major questions is, whether to introduce crawlerType
instead of useCheerioCrawler
. I would opt for crawlerType
. It will simplify code and is future-proof
.actor/input_schema.json
Outdated
"useCheerioCrawler": { | ||
"title": "Use Cheerio Crawler", | ||
"type": "boolean", | ||
"description": "If enabled, the Actor uses the Cheerio Crawler to extract the target web page content.", | ||
"default": false | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename input attribute useCheerioCrawler
.
I would make it similar to WCC: crawlerType
: https://apify.com/apify/website-content-crawler/input-schema#crawlerType
This would make it also future-proof, e.g. when we need to add a new one. I think @metalwarrior665 posted some new browser, specifically for AI use-case (I forgot the name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, switching to crawlerType
input is more future-proof and flexible solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on the dropdown. Not sure if we should call it Cheerio
since the users should not care how is the crawler class called in Crawlee. I would also not mention "crawler", we are not even crawling much, it just scrapes a few pages from Google. I would define it as "browser" vs "plain HTTP". Not sure about the field name, maybe something like scrapingTool
?
* Get existing crawler based on crawlerOptions and scraperSettings, if not present -> create new | ||
*/ | ||
export const addPlaywrightCrawlRequest = async ( | ||
request: RequestOptions<PlaywrightCrawlerUserData>, | ||
export const addContentCrawlRequest = async ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! contentCrawler
is indeed a way better name
src/main.ts
Outdated
await createAndStartAllCrawlers( | ||
searchCrawlerOptions, | ||
contentCrawlerOptions, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await createAndStartAllCrawlers( | |
searchCrawlerOptions, | |
contentCrawlerOptions, | |
); | |
await createAndStartAllCrawlers(searchCrawlerOptions, contentCrawlerOptions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can fit into single line, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are now experimenting with Prettier so soon we might see more spreading in the code :/
src/main.ts
Outdated
}); | ||
} else { | ||
log.info('Actor is running in the NORMAL mode.'); | ||
try { | ||
await handleSearchNormalMode(input, cheerioCrawlerOptions, playwrightCrawlerOptions, playwrightScraperSettings); | ||
await handleSearchNormalMode(input, searchCrawlerOptions, contentCrawlerOptions[0], contentScraperSettings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to dig into the code to understand what’s in the first element of the contentCrawlerOptions[0]
array, which isn’t ideal.
A more straightforward approach might be to introduce a crawlerType
field. That way, contentCrawlerOptions
could be set directly instead of using an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this isn't ideal solution. The reason why processInput
function returns an array of contentCrawlerOptions
is the pre-creation of all browser where I need options for all of the content crawlers.
To make this more understandable I could create separate functions to process the input for standalone and standby modes. Both functions would internally use the already existing function for processing input but the would return the data in better format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bad, someone reordering the push can break this. I think it would be better to create distinct types for normal vs standby start so it is clear one starts cheerio | playwright while other starts both
src/utils.ts
Outdated
@@ -38,16 +38,16 @@ export function randomId() { | |||
* The maxResults parameter is passed to the UserData object, when the request is handled it is used to limit | |||
* the number of search results without the created overhead. | |||
* | |||
* Also add the playwrightCrawlerKey to the UserData object to be able to identify the playwright crawler should | |||
* Also add the contentCrawlerKey to the UserData object to be able to identify the conten crawler should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Also add the contentCrawlerKey to the UserData object to be able to identify the conten crawler should | |
* Also add the contentCrawlerKey to the UserData object to be able to identify the content crawler should |
src/request-handler.ts
Outdated
|
||
async function handleContent( | ||
$: CheerioCrawlingContext['$'], | ||
crawlerName: 'playwright' | 'cheerio', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have crawlerType
defined, we can also use it here
src/search.ts
Outdated
playwrightCrawlerOptions, | ||
const { contentCrawlerKey } = await createAndStartCrawlers( | ||
searchCrawlerOptions, | ||
contentCrawlerOptions[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here, when reading the code, one is not sure what is at element [0] without a context
src/types.ts
Outdated
| 'cheerio-request-end' | ||
| 'cheerio-request-handler-start' | ||
| 'cheerio-before-response-send' | ||
| 'cheerio-failed-request' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cheerio-failed-request
- duplicate declaration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just minor things. Could you also share some test runs?
.actor/input_schema.json
Outdated
"useCheerioCrawler": { | ||
"title": "Use Cheerio Crawler", | ||
"type": "boolean", | ||
"description": "If enabled, the Actor uses the Cheerio Crawler to extract the target web page content.", | ||
"default": false | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on the dropdown. Not sure if we should call it Cheerio
since the users should not care how is the crawler class called in Crawlee. I would also not mention "crawler", we are not even crawling much, it just scrapes a few pages from Google. I would define it as "browser" vs "plain HTTP". Not sure about the field name, maybe something like scrapingTool
?
.actor/input_schema.json
Outdated
"useCheerioCrawler": { | ||
"title": "Use Cheerio Crawler", | ||
"type": "boolean", | ||
"description": "If enabled, the Actor uses the Cheerio Crawler to extract the target web page content.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description should explain why to choose it and give some performance examples (how long it takes with browser vs plain HTTP)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point with the browser vs plain HTTP, and also scrapingTool
sounds good
src/crawlers.ts
Outdated
* Creates and starts a Google search crawler and content crawlers for all provided configurations. | ||
* A crawler won't be created if it already exists. | ||
*/ | ||
export async function createAndStartAllCrawlers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having one function called createAndStartCrawlers
and other createAndStartAllCrawlers
is really confusing. I see the only difference is that first starts only one content crawler while other both? And that is probably because how this works differently in normal vs standby start? I would either get rid of this function and just inline the code or make it one and pass some parameters in to choose which
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the difference here is in standalone vs standby mode. I'll remove the createAndStartAllCrawlers
function and figure out how to use the createAndStartCrawlers
for both usecases.
src/crawlers.ts
Outdated
failedRequestHandler: ({ request }, err) => failedRequestHandlerPlaywright(request, err), | ||
}); | ||
// Typeguard to determine if we should use Playwright or Cheerio crawler | ||
const usePlaywrightCrawler = 'browserPoolOptions' in crawlerOptions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syntax 'string in object' doesn't typecheck (if you do 'garbage' in crawlerOptions;
, you will get no type warning and return false in runtime). This is bad because you rely on browserPoolOptions
which might get removed in the future.
You probably need to do discriminated union, this means adding type: 'cheerio' | 'playwright'
to the PlaywrightCrawlerOptions
and CheerioCrawlerOptions
. You could make a wrapper type like this
type CrawlerOptions = { type: 'cheerio', crawlerOptions: CheerioCrawlerOptions } | { type: 'playwright', crawlerOptions: PlaywrightCrawlerOptions }
const fn = (options: CrawlerOptions): CheerioCrawler | PlaywrightCrawler => {
const { type, crawlerOptions } = options;
if (type === 'cheerio') {
return new CheerioCrawler(crawlerOptions);
}
return new PlaywrightCrawler(crawlerOptions);
}
Another way would be to remove some layer of function call so you don't have to pass this around as union.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the advice. I'll rework this to not rely on internal properties of the crawler options.
Thank you for the reviews. I have modified the code based on your comments. In regards to the input schema I implemented the Here is test run of the standalone mode: https://console.apify.com/view/runs/hE1lzBbG0dmWlDSMD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just touches on the schema and we can go. I think there are likely opportunities for refactoring, the managing of inputs/crawlers still seems a bit more complicated than needed but we should not block this PR.
.actor/input_schema.json
Outdated
"description": "Choose what scraping tool to use for extracting the target web pages. The Browser tool is more powerful and can handle JavaScript heavy websites. While the Plain HTML tool is about two times faster.", | ||
"editor": "select", | ||
"default": "playwright", | ||
"enum": ["playwright", "cheerio"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to propose to sync these to enumTitles
because if these are different, it creates mess with debugging and discussion later. But maybe it would be better to expand the enumTitles
as Browser (uses Playwright)
?
.actor/input_schema.json
Outdated
"editor": "select", | ||
"default": "playwright", | ||
"enum": ["playwright", "cheerio"], | ||
"enumTitles": ["Browser", "Plain HTML"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant Plain HTTP
but honestly not sure what is the best term. WCC uses Raw HTTP so maybe let's use that too, we have same/similar users.
@@ -1,4 +1,4 @@ | |||
import inputSchema from '../.actor/input_schema.json' assert { type: 'json' }; | |||
import inputSchema from '../.actor/input_schema.json' with { type: 'json' }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing. I see this in the log (node:21) ExperimentalWarning: Importing JSON modules is an experimental feature and might change at any time
Can you try to suppress this warning? If that would not work, then try to increase Node.js version (but that could rarely break other things so let's try suppress first)
I have changed the values in the input schema (and corresponding constants in code). Also I have managed to suppress the experimental warning from node. Here is a run of this version: https://console.apify.com/view/runs/ZIHv6dXe1hmlly0aX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
I really like it, added new functionallity and the code looks better than before 🙇🏻
There was one typo - I added a suggestion.
I've prepared an update for readme and input_schema.json (reorder fields to move scrapingTool more to the top) in this branch I can merge it once you merge this.
Co-authored-by: Jiří Spilka <[email protected]>
Thanks for the feedback. I have fixed the typo. If you're happy with it we can merge the PR. But I can't do it since I don't have the button available. |
@matyascimbulka I thought anyone at Apify can do it. I'm not sure what is a correct setting. I added you, you should be able to merge it now. |
I have merged the PR. Do we need to build it in console or is automated? |
Thank you
No, we need to build in the console. Let us first merge this PR and then build it. Will you please build it and test briefly? |
Based on #48 this PR implements Cheerio crawler as 2nd option for crawling target web pages. To enable this feature a
useCheerioCrawler
input property and query parameter have been added. The default value for these options is false.The Cheerio crawler works in both standby and normal mode. In standby mode the Actor will pre-create all 3 crawler (searchCrawler, playwrightContentCrawler and cheerioContentCrowler).
@metalwarrior665 @jirispilka I seem to be unable to add anyone for code review. Could you add yourself?