Skip to content

Commit

Permalink
Fix formatting suggested by Michal
Browse files Browse the repository at this point in the history
Co-authored-by: Michał Olender <[email protected]>
  • Loading branch information
metalwarrior665 and TC-MO authored Feb 5, 2025
1 parent ad77d26 commit ed1e9e0
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 18 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -7,31 +7,31 @@ slug: /advanced-web-scraping/crawling/sitemaps-vs-search

The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.

Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.
Unfortunately, _most modern websites restrict pagination_ only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.

There are two main approaches to solving this problem:

- Extracting all page URLs from the website's **sitemap**.
- Extracting all page URLs from the website's _sitemap_.
- Using **categories, search and filters** to split the website so we get under the pagination limit.

Both of these approaches have their pros and cons so the best solution is to **use both and combine the results**. Here we will learn why.
Both of these approaches have their pros and cons so the best solution is to _use both and combine the results_. Here we will learn why.

## Pros and cons of sitemaps

Sitemap is usually a simple XML file that contains a list of all pages on the website. They are created and maintained mainly for search engines like Google to help ensure that the website gets fully indexed there. They are commonly located at URLs like `https://example.com/sitemap.xml` or `https://example.com/sitemap.xml.gz`. We will get to work with sitemaps in the next lesson.

### Pros

- **Quick to set up** - The logic to find all sitemaps and extract all URLs is usually simple and can be done in a few lines of code.
- **Fast to run** - You only need to run a single request for each sitemap that contains up to 50,000 URLs. This means you can get all the URLs in a matter of seconds.
- **Usually complete** - Websites have an incentive to keep their sitemaps up to date as they are used by search engines. This means that they usually contain all pages on the website.
- _Quick to set up_ - The logic to find all sitemaps and extract all URLs is usually simple and can be done in a few lines of code.
- _Fast to run_ - You only need to run a single request for each sitemap that contains up to 50,000 URLs. This means you can get all the URLs in a matter of seconds.
- _Usually complete_ - Websites have an incentive to keep their sitemaps up to date as they are used by search engines. This means that they usually contain all pages on the website.

### Cons

- **Does not directly reflect the website** - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs.
- **Updated in intervals** - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week.
- **Hard to find or unavailable** - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all.
- **Streamed, compressed, and archived** - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code or use a scraping framework.
- _Does not directly reflect the website_ - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs.
- _Updated in intervals_ - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week.
- _Hard to find or unavailable_ - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all.
- _Streamed, compressed, and archived_ - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code or use a scraping framework.

## Pros and cons of categories, search, and filters

Expand All @@ -41,15 +41,15 @@ The pros and cons of this approach are pretty much the opposite of relying on si

### Pros

- **Directly reflects the website** - With most scraping use-cases, we want to analyze the website as the regular users see it. By going through the intended user flow, we ensure that we are getting the same pages as the users.
- **Updated in real-time** - The website is updated in real-time so we can be sure that we are getting all pages.
- **Often contain detailed data** - While sitemaps are usually just a list of URLs, categories, searches and filters often contain additional data like product names, prices, categories, etc, especially if available via JSON API. This means that we can sometimes get all the data we need without going to the detail pages.
- _Directly reflects the website_ - With most scraping use-cases, we want to analyze the website as the regular users see it. By going through the intended user flow, we ensure that we are getting the same pages as the users.
- _Updated in real-time_ - The website is updated in real-time so we can be sure that we are getting all pages.
- _Often contain detailed data_ - While sitemaps are usually just a list of URLs, categories, searches and filters often contain additional data like product names, prices, categories, etc, especially if available via JSON API. This means that we can sometimes get all the data we need without going to the detail pages.

### Cons

- **Complex to set up** - The logic to traverse the website is usually complex and can take a lot of time to get right. We will get to this in the next lessons.
- **Slow to run** - The traversing can require a lot of requests. Some filters or categories will have products we already found.
- **Not always complete** - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The tools we'll build in the following lessons will help us with this.
- _Complex to set up_ - The logic to traverse the website is usually complex and can take a lot of time to get right. We will get to this in the next lessons.
- _Slow to run_ - The traversing can require a lot of requests. Some filters or categories will have products we already found.
- _Not always complete_ - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The tools we'll build in the following lessons will help us with this.

## Do we know how many products there are?

Expand Down
2 changes: 1 addition & 1 deletion sources/academy/webscraping/advanced_web_scraping/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ slug: /advanced-web-scraping

# Advanced web scraping

In [**Web scraping for beginners**](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face.
In [Web scraping for beginners](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face.

In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ Here's what the output of this code looks like:
105
```
## Final note {#final-note}
## Final note
Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at the [Crawling with search](/academy/advanced-web-scraping/crawling/crawling-with-search) article.
Expand Down

0 comments on commit ed1e9e0

Please sign in to comment.