Skip to content

Newbie Guide To Scraping With Puppeteer

theohgr edited this page Aug 28, 2021 · 3 revisions

Newbie Guide To Puppeteer / Playwright

This is a brief guide intended to help new developers get up and running with Puppeteer and Puppeteer-Extra. Additionally any good off-site guides can be linked to as recommended reading.

Although learning a new language can be intimidating for a new developer, try not to get overwhelmed and tap out looking for a silver bullet. Anything "off the shelf" in regards to automation are notoriously poorly written, are unlikely to achieve your specific goals, and are much easier to detect than a custom solution.

Tip

Whenever it gets hard, think about this; when you're reflecting on your life 30 years from now, would you have preferred to have learned an incredibly valuable skill, or used that time playing video games / binge watching tv?

🎭 Why Puppeteer / Playwright? (or "Why Selenium Won't Work")

Many new developers to Puppeteer come from a Python/Selenium background. Unfortunately Selenium post-2019 is notoriously easy to detect (TODO: Add sources), and most operators facing off against advanced anti-bot vendors have migrated.

🕵‍♂ Why am I being detected?

If you've found this repo, it's safe to assume your use-case for Puppeteer / automation is not for browser-testing your own website. This means you've already encountered anti-bot detection countermeasures, or are about to (lol).

This is a much longer topic than can be adequately described here, but one blog that provides very well-articulated descriptions of just how easy it is to detect un-hardened automation tools is Antoine Vastel's blog which can be found here: https://antoinevastel.com

Although (for obvious reasons, given his position) he can't describe most current detection methods, I would consider absorbing every post as "required reading" as an introduction to understanding what you are up against.

👨‍🔧 Setting up your environment

Ok, enough talk, let's get started...

Installing Node

First up you'll need to install Node:

Choosing An IDE

Then you'll need an good TypeScript friendly IDE:

Code Quality (or "Tidying Up Your Crappy Code with Marie Kondo")

You'll want to enforce some code style as well to avoid your code becoming a mess...

For this, we'll use ESLint and Prettier:

Installation
$ npm i -D typescript eslint  prettier 
$ npm i -D eslint-config-prettier eslint-plugin-prettier @typescript-eslint/parser  @typescript-eslint prettier/@typescript-eslint

There's lots of ways to configure Prettier and ESLint, but here's a few example files to get you started. Create each one of these files in the root of your project and configure your IDE to use them.

.eslintrc.json
{
  "parser":  "@typescript-eslint/parser",
  "extends": [
	"plugin:@typescript-eslint/recommended",
	"plugin:prettier/recommended",
	"prettier/@typescript-eslint"
  ],
  "plugins": [
	"@typescript-eslint",
	"prettier"
  ],
  "rules":   {
	"semi":                               [
	  "error",
	  "never"
	],
	"no-debugger":                        "off",
	"no-console":                         0
  }
}
.prettierrc.json
{
  "semi": false
}

🧠 Understanding Modern JavaScript

If you've never worked with Node-based JavaScript or your prior experience is with old-school web-based JS, there's a lot to learn. But remember; every hour you spend learning this part pays 10x dividends down the line.

Async/Await/Promises

Classes

Arrow Functions

🛡 TypeScript

Given the Puppeteer and Puppeteer-Extra projects are written in TypeScript its recommend you start developing in TypeScript from the start to become familiar with the patterns and avoid learning bad habits (with vanilla JavaScript is prone to due to lack of strong typing / rules).

Below are some good introductory guides which you should read before starting your first project:

👶 First Project

Use the guides above to develop a basic app that does some "Hello World" type stuff into your console. Here's another one to get you started: https://code.visualstudio.com/docs/typescript/typescript-tutorial

Tip

A lot of the information published around modern JavaScript is targeted at browser. For automation, you need to focus entirely on Node guides for now - don't bother making anything web-based, you won't be using that skill-set, so trying to absorb this additional context is just going to lead to confusion.

Now you've gotten your head around the basics of a Node-based script, start looking at very simple demo projects of Puppeteer.

Here's some resources to get you going:

Vanilla Puppeteer

Broadly speaking, your first code should look something like this (extra points for using ES6 syntax):

// We'll only need to import one package for this, make sure you've installed it with `npm install puppeteer/puppeteer`...
import Puppeteer from "puppeteer"

// We'll start with a self-executing async function.
(async () => { 
  // First let's create a new Browser instance.
  const browser: Puppeteer.Browser = await Puppeteer.launch({headless: false})
  // Then we need to instantiate a new Page.
  const page: Puppeteer.Page = await browser.newPage()
  // How about we take a quick screenshot of Google?
  await page.setViewport({ width: 1280, height: 800 })
  await page.goto('https://www.google.com')
  await page.screenshot({ path: 'myscreenshot.png', fullPage: true })
  // Always clean up your browser after use.
  await browser.close()
})()

Puppeteer Extra

Vanilla Puppeteer is nice and all, but how do I make it ...stealthy? 🕵‍♂

Here's a basic demo using Puppeteer Extra with the Stealth plugin. As an exercise, see if you can convert this vanilla ES6 JavaScript to TypeScript!

// Import our required tooling.
import puppeteer from "puppeteer-extra"
import stealth from "puppeteer-extra-plugin-stealth"

// Let's use some commonly used defaults to help us "hide" in the crowd...
const options = {
  headless: false,
  ignoreHTTPSErrors: true,
  args: [
    "--no-sandbox",
    "--disable-setuid-sandbox",
    "--disable-sync",
    "--ignore-certificate-errors",
    "--lang=en-US,en;q=0.9",
  ],
  defaultViewport: { width: 1366, height: 768 },
}

;(async () => {
  // Before we start, let's enable the Stealth plugin.
  // In practice, you'd generally enable a bunch of specific evasions you require, but for now the defaults will be fine.
  puppeteer.use(stealth())

  // Create a new browser and initialize a page.
  const browser = await puppeteer.launch(options)
  const page = await browser.newPage()

  // Let's scrape some data from a "scrape-friendly" site!
  // We'll generally want to wait for the page to fully load which is acheived by waiting for the networkidle2 event.
  await page.goto("https://scrapethissite.com/", { waitUntil: `networkidle2` })

  // On the homepage, we want to find the button that links to the list of lessons.
  const exploreSandboxButton = await page.$(`section#hero a[href="/pages/"]`)

  // If we can't find the button, it means something went wrong!
  if (!exploreSandboxButton) {
    throw new Error(`Could not find sandbox button!`)
  }

  // Click the button with a delay between mousedown and mouseup events. In practice you would randomise this.
  await exploreSandboxButton.click({ delay: 5 })

  // Once we click the button the browser will start navigation to the next page, so let's wait for that.
  await page.waitForNavigation({ waitUntil: "networkidle2" })

  // Unfortunately the way Puppeteer (currently) scrolls is very easy to detect, so we'll need to send a raw CDP command.
  // This is a just a bit of fun for the demo. In practice you would loop this with random distances util you reach your target Y position.
  await page._client.send("Input.synthesizeScrollGesture", {
    x: 0,
    y: 0,
    xDistance: 0,
    yDistance: -100,
  })

  // If the element we are looking for doesn't exist, there's no point continuing.
  const pagesDiv = await page.$("div#pages")
  if (!pagesDiv) {
    throw new Error(`Could not find pages container!`)
  }

  // Normally we'd avoid executing on-page JS where possible, but this is a good way to demonstrate how to execute a script within the page.
  const lessonList = await page.evaluate(async () => {
    let results = []

    // Get all the divs with the class "page", this is our list of Lessons.
    const lessons = document.querySelectorAll("div.page")

    // Parse the important information from each div.
    lessons.forEach((lesson) => {
      results.push({
        title: lesson.querySelector("h3.page-title").innerText,
        description: lesson.querySelector("p.lead").innerText,
      })
    })

    // Return the results we have scraped.
    return results
  })

  // If everything worked on-page, we should have a collection of lessons!
  console.log("Lessons: ", lessonList)

  // Clean up
  await page.close()

  // Exit!
  process.exit()
})()

🌿 Expanding Your Project

As your project grows, you'll find you need to start abstracting components out of your single demo.ts file you've been working on to avoid a huge file that is impossible to follow.

There isn't really any common project structure enforced by default, but you will typically have an app.ts or index.ts file that bootstraps your app and responds to commands.

// TODO: Would be great if someone can describe a typical Puppeteer project structure in more detail here

More to come!

Feel free to contribute / edit!