-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Cheerio content crawler #52
Merged
matyascimbulka
merged 13 commits into
apify:master
from
matyascimbulka:feat/cheerio-req-handler
Mar 7, 2025
Merged
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
22a5cc1
feat: Add cheerio crawler as option for content crawling
matyascimbulka 3d4f0b6
refactor: Consolidate req handlers into one file
matyascimbulka ca580d6
refactor: Rename options object for search crawler
matyascimbulka 71ca906
feat: Process input return array of content crawler options
matyascimbulka e7d3150
feat: Add function to start all crawlers
matyascimbulka 5ba88bd
feat: Update README
matyascimbulka 63d100a
fix: Fix run issues on platform
matyascimbulka c5c09aa
feat: Add ContentCrawlerOptions for better type checking
matyascimbulka e2cf832
feat: Change input flag to select and remove `createAndStartCrawlers`
matyascimbulka 1c9f608
refactor: Add ContentCrawlerType enum
matyascimbulka ab242e7
feat: Update README
matyascimbulka 608c0e6
feat: Modify inputs and supress node warnings
matyascimbulka eb5260c
refactor: Fix typo in main.ts
matyascimbulka File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,10 +12,11 @@ import { | |
RequestOptions, | ||
} from 'crawlee'; | ||
|
||
import { ContentCrawlerTypes } from './const.js'; | ||
import { scrapeOrganicResults } from './google-search/google-extractors-urls.js'; | ||
import { failedRequestHandlerPlaywright, requestHandlerPlaywright } from './playwright-req-handler.js'; | ||
import { failedRequestHandler, requestHandlerCheerio, requestHandlerPlaywright } from './request-handler.js'; | ||
import { addEmptyResultToResponse, sendResponseError } from './responses.js'; | ||
import type { PlaywrightCrawlerUserData, SearchCrawlerUserData } from './types.js'; | ||
import type { ContentCrawlerOptions, ContentCrawlerUserData, SearchCrawlerUserData } from './types.js'; | ||
import { addTimeMeasureEvent, createRequest } from './utils.js'; | ||
|
||
const crawlers = new Map<string, CheerioCrawler | PlaywrightCrawler>(); | ||
|
@@ -25,42 +26,22 @@ export function getCrawlerKey(crawlerOptions: CheerioCrawlerOptions | Playwright | |
return JSON.stringify(crawlerOptions); | ||
} | ||
|
||
/** | ||
* Creates and starts a Google search crawler and Playwright content crawler with the provided configurations. | ||
* A crawler won't be created if it already exists. | ||
*/ | ||
export async function createAndStartCrawlers( | ||
cheerioCrawlerOptions: CheerioCrawlerOptions, | ||
playwrightCrawlerOptions: PlaywrightCrawlerOptions, | ||
startCrawlers: boolean = true, | ||
) { | ||
const { crawler: searchCrawler } = await createAndStartSearchCrawler( | ||
cheerioCrawlerOptions, | ||
startCrawlers, | ||
); | ||
const { key: playwrightCrawlerKey, crawler: playwrightCrawler } = await createAndStartCrawlerPlaywright( | ||
playwrightCrawlerOptions, | ||
startCrawlers, | ||
); | ||
return { searchCrawler, playwrightCrawler, playwrightCrawlerKey }; | ||
} | ||
|
||
/** | ||
* Creates and starts a Google search crawler with the provided configuration. | ||
* A crawler won't be created if it already exists. | ||
*/ | ||
async function createAndStartSearchCrawler( | ||
cheerioCrawlerOptions: CheerioCrawlerOptions, | ||
export async function createAndStartSearchCrawler( | ||
searchCrawlerOptions: CheerioCrawlerOptions, | ||
startCrawler: boolean = true, | ||
) { | ||
const key = getCrawlerKey(cheerioCrawlerOptions); | ||
const key = getCrawlerKey(searchCrawlerOptions); | ||
if (crawlers.has(key)) { | ||
return { key, crawler: crawlers.get(key) }; | ||
} | ||
|
||
log.info(`Creating new cheerio crawler with key ${key}`); | ||
const crawler = new CheerioCrawler({ | ||
...(cheerioCrawlerOptions as CheerioCrawlerOptions), | ||
...(searchCrawlerOptions as CheerioCrawlerOptions), | ||
requestQueue: await RequestQueue.open(key, { storageClient: client }), | ||
requestHandler: async ({ request, $: _$ }: CheerioCrawlingContext<SearchCrawlerUserData>) => { | ||
// NOTE: we need to cast this to fix `cheerio` type errors | ||
|
@@ -92,10 +73,10 @@ async function createAndStartSearchCrawler( | |
request.userData.query, | ||
result, | ||
responseId, | ||
request.userData.playwrightScraperSettings!, | ||
request.userData.contentScraperSettings!, | ||
request.userData.timeMeasures!, | ||
); | ||
await addPlaywrightCrawlRequest(r, responseId, request.userData.playwrightCrawlerKey!); | ||
await addContentCrawlRequest(r, responseId, request.userData.contentCrawlerKey!); | ||
} | ||
}, | ||
failedRequestHandler: async ({ request }, err) => { | ||
|
@@ -118,50 +99,78 @@ async function createAndStartSearchCrawler( | |
} | ||
|
||
/** | ||
* Creates and starts a Playwright content crawler with the provided configuration. | ||
* Creates and starts a content crawler with the provided configuration. | ||
* Either Playwright or Cheerio crawler will be created based on the provided crawler options. | ||
* A crawler won't be created if it already exists. | ||
*/ | ||
async function createAndStartCrawlerPlaywright( | ||
crawlerOptions: PlaywrightCrawlerOptions, | ||
export async function createAndStartContentCrawler( | ||
contentCrawlerOptions: ContentCrawlerOptions, | ||
startCrawler: boolean = true, | ||
) { | ||
const { type: crawlerType, crawlerOptions } = contentCrawlerOptions; | ||
|
||
const key = getCrawlerKey(crawlerOptions); | ||
if (crawlers.has(key)) { | ||
return { key, crawler: crawlers.get(key) }; | ||
} | ||
|
||
log.info(`Creating new playwright crawler with key ${key}`); | ||
const crawler = new PlaywrightCrawler({ | ||
...(crawlerOptions as PlaywrightCrawlerOptions), | ||
keepAlive: crawlerOptions.keepAlive, | ||
requestQueue: await RequestQueue.open(key, { storageClient: client }), | ||
requestHandler: async (context: PlaywrightCrawlingContext) => { | ||
await requestHandlerPlaywright(context as unknown as PlaywrightCrawlingContext<PlaywrightCrawlerUserData>); | ||
}, | ||
failedRequestHandler: ({ request }, err) => failedRequestHandlerPlaywright(request, err), | ||
}); | ||
const crawler = crawlerType === 'playwright' | ||
? await createPlaywrightContentCrawler(crawlerOptions, key) | ||
: await createCheerioContentCrawler(crawlerOptions, key); | ||
|
||
if (startCrawler) { | ||
crawler.run().then( | ||
() => log.warning(`Crawler playwright has finished`), | ||
() => log.warning(`Crawler ${crawlerType} has finished`), | ||
() => {}, | ||
); | ||
log.info('Crawler playwright has started 💪🏼'); | ||
log.info(`Crawler ${crawlerType} has started 💪🏼`); | ||
} | ||
crawlers.set(key, crawler); | ||
log.info(`Number of crawlers ${crawlers.size}`); | ||
return { key, crawler }; | ||
} | ||
|
||
async function createPlaywrightContentCrawler( | ||
crawlerOptions: PlaywrightCrawlerOptions, | ||
key: string, | ||
): Promise<PlaywrightCrawler> { | ||
log.info(`Creating new playwright crawler with key ${key}`); | ||
return new PlaywrightCrawler({ | ||
...crawlerOptions, | ||
keepAlive: crawlerOptions.keepAlive, | ||
requestQueue: await RequestQueue.open(key, { storageClient: client }), | ||
requestHandler: async (context) => { | ||
await requestHandlerPlaywright(context as unknown as PlaywrightCrawlingContext<ContentCrawlerUserData>); | ||
}, | ||
failedRequestHandler: ({ request }, err) => failedRequestHandler(request, err, ContentCrawlerTypes.PLAYWRIGHT), | ||
}); | ||
} | ||
|
||
async function createCheerioContentCrawler( | ||
crawlerOptions: CheerioCrawlerOptions, | ||
key: string, | ||
): Promise<CheerioCrawler> { | ||
log.info(`Creating new cheerio crawler with key ${key}`); | ||
return new CheerioCrawler({ | ||
...crawlerOptions, | ||
keepAlive: crawlerOptions.keepAlive, | ||
requestQueue: await RequestQueue.open(key, { storageClient: client }), | ||
requestHandler: async (context) => { | ||
await requestHandlerCheerio(context as unknown as CheerioCrawlingContext<ContentCrawlerUserData>); | ||
}, | ||
failedRequestHandler: ({ request }, err) => failedRequestHandler(request, err, ContentCrawlerTypes.CHEERIO), | ||
}); | ||
} | ||
|
||
/** | ||
* Adds a search request to the Google search crawler. | ||
* Create a response for the request and set the desired number of results (maxResults). | ||
*/ | ||
export const addSearchRequest = async ( | ||
request: RequestOptions<PlaywrightCrawlerUserData>, | ||
cheerioCrawlerOptions: CheerioCrawlerOptions, | ||
request: RequestOptions<ContentCrawlerUserData>, | ||
searchCrawlerOptions: CheerioCrawlerOptions, | ||
) => { | ||
const key = getCrawlerKey(cheerioCrawlerOptions); | ||
const key = getCrawlerKey(searchCrawlerOptions); | ||
const crawler = crawlers.get(key); | ||
|
||
if (!crawler) { | ||
|
@@ -174,26 +183,28 @@ export const addSearchRequest = async ( | |
}; | ||
|
||
/** | ||
* Adds a content crawl request to the Playwright content crawler. | ||
* Adds a content crawl request to selected content crawler. | ||
* Get existing crawler based on crawlerOptions and scraperSettings, if not present -> create new | ||
*/ | ||
export const addPlaywrightCrawlRequest = async ( | ||
request: RequestOptions<PlaywrightCrawlerUserData>, | ||
export const addContentCrawlRequest = async ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! |
||
request: RequestOptions<ContentCrawlerUserData>, | ||
responseId: string, | ||
playwrightCrawlerKey: string, | ||
contentCrawlerKey: string, | ||
) => { | ||
const crawler = crawlers.get(playwrightCrawlerKey); | ||
const crawler = crawlers.get(contentCrawlerKey); | ||
const name = crawler instanceof PlaywrightCrawler ? 'playwright' : 'cheerio'; | ||
|
||
if (!crawler) { | ||
log.error(`Playwright crawler not found: key ${playwrightCrawlerKey}`); | ||
log.error(`Content crawler not found: key ${contentCrawlerKey}`); | ||
return; | ||
} | ||
try { | ||
await crawler.requestQueue!.addRequest(request); | ||
// create an empty result in search request response | ||
// do not use request.uniqueKey as responseId as it is not id of a search request | ||
addEmptyResultToResponse(responseId, request); | ||
log.info(`Added request to the playwright-content-crawler: ${request.url}`); | ||
log.info(`Added request to the ${name}-content-crawler: ${request.url}`); | ||
} catch (err) { | ||
log.error(`Error adding request to playwright-content-crawler: ${request.url}, error: ${err}`); | ||
log.error(`Error adding request to ${name}-content-crawler: ${request.url}, error: ${err}`); | ||
} | ||
}; |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing. I see this in the log
(node:21) ExperimentalWarning: Importing JSON modules is an experimental feature and might change at any time
Can you try to suppress this warning? If that would not work, then try to increase Node.js version (but that could rarely break other things so let's try suppress first)