Skip to content

MinglunZhu/crrri_usage_eg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

crrri_usage_eg

Description

Example usage for the crrri package for R for headless Chrome web scraping.

Please note: This is not a working demo, it is example code mainly for you to read through. Below are descriptions explaining the main parts of the code.

Installing crrri

Crrri enables web scraping with a headless chrome. The main advantage of using a headless browser is that it is less likely to be detected as a bot and also enables javascript rendered content.

To install crrri in R, run:

devtools::install_github('rlesur/crrri')

Code Explanation

Importing Packages

  • The promises package is used to create a promise for javascript operations.
  • The crrri package is the main package we will use to do headless Chrome.
  • The rvest package is used to parse HTML. We could also use Javascript, but using R code is better since it will run synchronously.
  • The dplyr package is just a must have package I use for all my code.

Constants

Then we define constants including urls needed for scraping, the element IDs for the elements we want to scrape and column names for the result dataframe.

Start Headless Chrome

Then before the hash symbol separator, we start the headless Chrome and navigate to the login page for the targeted site. Before we do, we point to the installation location of our local Chrome with the Sys.setenv(...) code.

Once environment variable is set, the code should open Chrome in headless mode and open the target site's login page.

Now you should swtich to the headless Chrome and login to the website.

Parsing HTML

The code below the hash symbol separator defines a couple of functions that parses the result HTML for the 2 target webpages. I won't go into detail here since it's page specific. Depend on the page you are scraping, and the elements and data you are targeting, you will design these functions differently.

The gist of the first function is that it navigates to the target url, waits for page to load, then retreives the result HTML using a JavaScript expresssion, then parses the said HTML using rvest.

The gist of the second function is that it navigates to the target url, waits for page to load, then do some further navigation using JavaScript's dispatching click events. This is due to the target content being an iFrame which must be loaded from within the page. Then the JavaScript creates a promise for the load event of the said iFrame. Once the iFrame is loaded, the JavaScript performs a search from within the iFrame and waits, in the form of a promise, for the iFrame to refresh itself due to the search. Then it sets a variable to true to indicate the search result is available.

Then we create a promise in R and uses the later package to check every 1 second to see if result is available and only fullfill the promise if result is available. Apprarently there is a better way to do it using some crrri operations, but I haven't figured that out, yet.

Once result is available, we retrieve the iFrame HTML using JavaScript and proceed to parse it with rvest.

Save Result in DataFrame

The third function utilizes the first 2 functions to get parsed data and saves it as a dataframe row.

Chaining Scrape

The above operations scrapes for 1 keyword and outputs 1 row of data. If we want to scrape for multiple keywords at the same time, we'd have to open multiple headless Chromes. Since the website requires a login, multiple logins may not be allow. So, instead, we chain the scraping of each keyword 1 after another.

Since scraping for 1 keyword is a promise, we just need to go through all keywords we want to scrape and create a promise for each and chain them together by using Reduce().

Then we check to only using chaining if there are more than 1 keyword we want to scrape.

Post Processing

Then when the chained promises completes and reaches its end, we close the headless Chrome, clean up the data, and outputs into a csv file.

License

GNU GPLv3

About

Example usage for the crrri package for R for headless Chrome web scraping.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages