Skip to content

Commit

Permalink
docs(academy-puppeteer): clarify disadvantages of browsers and unifie…
Browse files Browse the repository at this point in the history
…d cheerio parsing (#1442)

Co-authored-by: Honza Javorek <[email protected]>
  • Loading branch information
metalwarrior665 and honzajavorek authored Feb 5, 2025
1 parent f62fbb5 commit 43bafa9
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ Now that we know how to execute scripts on a page, we're ready to learn a bit ab
1. Directly in `page.evaluate()` and other evaluate functions such as `page.$$eval()`.
2. In the Node.js context using a parsing library such as [Cheerio](https://www.npmjs.com/package/cheerio)

:::tip Crawlee and parsing with Cheerio

If you are using Crawlee, we highly recommend the [parseWithCheerio](https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#parseWithCheerio) function for unified data extraction syntax. This way, switching between browser and plain HTTP scraping is a breeze.

:::

## Setup

Here is the base setup for our code, upon which we'll be building off of in this lesson:
Expand Down
8 changes: 7 additions & 1 deletion sources/academy/webscraping/puppeteer_playwright/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,13 @@ Both packages were developed by the same team and are very similar, which is why

When automating a headless browser, you can do a whole lot more in comparison to making HTTP requests for static content. In fact, you can programmatically do pretty much anything a human could do with a browser, such as clicking elements, taking screenshots, typing into text areas, etc.

Additionally, since the requests aren't static, [dynamic content](../../glossary/concepts/dynamic_pages.md) can be rendered and interacted with (or, data from the dynamic content can be scraped).
Additionally, since the requests aren't static, [dynamic content](../../glossary/concepts/dynamic_pages.md) can be rendered and interacted with (or, data from the dynamic content can be scraped). Turn on the [headful mode](https://playwright.dev/docs/api/class-testoptions#test-options-headless) (`headless: false`) to see exactly what the browser is doing.

Browsers can also be effective for [overcoming anti-scraping measures](../anti_scraping/index.md), especially if the website is running [JavaScript browser challenges](../anti_scraping/techniques/browser_challenges.md).

## Disadvantages of headless browsers

Browsers are slow and expensive to run. In the follow-up courses, the Apify Academy will show you how to scrape websites without a browser. Every website can potentially be reverse-engineered into a series of quick and cheap HTTP calls, but it might require significant effort and specialized knowledge.

## Setup {#setup}

Expand Down

0 comments on commit 43bafa9

Please sign in to comment.