Rust is rapidly gaining attention as a programming language offering performance just as high as C/C++, especially regarding web scraping. However, unlike Python, which is relatively easy to learn, oftentimes at the cost of performance, Rust can be tricky to figure out.
It doesn't mean that scraping with Rust is not possible or extremely hard. Scraping with Rust can be challenging only if you don't know how to begin.
This article will guide you an overview of the process of writing a fast and efficient Rust web scraper.
For a detailed explanation, see this blog post.
Download rustup
from https://www.rust-lang.org/tools/install page. For Windows, download RUSTUP-INIT and run it to install Rust.
On macOS and Linux, this page will show you the command to install rustup
. The command will be similar to the following:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Run this command from the terminal and follow the prompts to install Rust.
Open the terminal and run the following command to initialize an empty project:
$ cargo new book_scraper
Now, open the Cargo.toml
file, and add the following lines:
[dependencies]
reqwest = {version = "0.11", features = ["blocking"]}
scraper = "0.13.0"
// main.rs
fn main() {
let url = "https://books.toscrape.com/";
let response = reqwest::blocking::get(url).expect("Could not load url.");
let body = response.text().unwrap();
print!("{}",body);
}
Open https://books.toscrape.com/ in Chrome and examine the HTML markup of the web page.
use scraper::{Html, Selector};
// ...
let book_selector = Selector::parse("article.product_pod").unwrap();
Now the selector is ready to be used. Add the following lines to the main function:
for element in document.select(&book_selector) {
// more code here
}
Extracting product description
for element in document.select(&book_selector) {
let book_name_element = element.select(&book_name_selector).next().expect("Could not select book name.");
let book_name = book_name_element.value().attr("title").expect("Could not find title attribute.");
let price_element = element.select(&book_price_selector).next().expect("Could not find price");
let price = price_element.text().collect::<String>();
println!("{:?} - {:?}",book_name, price);
}
Extracting product links
Within the for loop, add the following line:
let book_link_element = element.select(&book_name_selector).next().expect("Could not find book link element.");
let book_link= book_link_element.value().attr("href").expect("Could not find href attribute");
First, add the following to Cargo.toml
dependencies:
csv="1.1"
Update main.rs
as follows:
// main.rs
use scraper::{Html, Selector};
fn main() {
let url = "https://books.toscrape.com/";
let response = reqwest::blocking::get(url).expect("Could not load url.");
let body = response.text().expect("No response body found.");
let document = Html::parse_document(&body);
let book_selector = Selector::parse("article.product_pod").expect("Could not create selector.");
let book_name_selector = Selector::parse("h3 a").expect("Could not create selector.");
let book_price_selector = Selector::parse(".price_color").expect("Could not create selector.");
let mut wtr = csv::Writer::from_path("books.csv").expect("Could not create file.");
wtr.write_record(&["Book Name", "Price", "Link"])
.expect("Could not write header.");
for element in document.select(&book_selector) {
let book_name_element = element
.select(&book_name_selector)
.next()
.expect("Could not select book name.");
let book_name = book_name_element
.value()
.attr("title")
.expect("Could not find title attribute.");
let price_element = element
.select(&book_price_selector)
.next()
.expect("Could not find price");
let price = price_element.text().collect::<String>();
let book_link_element = element
.select(&book_name_selector)
.next()
.expect("Could not find book link element.");
let book_link = book_link_element
.value()
.attr("href")
.expect("Could not find href attribute");
wtr.write_record([book_name, &price, &book_link])
.expect("Could not create selector.");
}
wtr.flush().expect("Could not close file");
println!("Done");
}
Enter the following command to run the code:
$ cargo run
If you wish to find out more about web scraping with Rust, see our blog post.