Skip to content

Commit

Permalink
apply the rest of non-suggestion feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
metalwarrior665 committed Feb 5, 2025
1 parent ed1e9e0 commit 4276746
Show file tree
Hide file tree
Showing 4 changed files with 18 additions and 20 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -28,26 +28,26 @@ If the website has a `robots.txt` file, it often contains sitemap URLs. The site

### Common URL paths

You can try to iterate over common URL paths like:

- /sitemap.xml
- /product_index.xml
- /product_template.xml
- /sitemap_index.xml
- /sitemaps/sitemap_index.xml
- /sitemap/product_index.xml
- /media/sitemap.xml
- /media/sitemap/sitemap.xml
- /media/sitemap/index.xml
You can check some common URL paths, such as the following:

/sitemap.xml
/product_index.xml
/product_template.xml
/sitemap_index.xml
/sitemaps/sitemap_index.xml
/sitemap/product_index.xml
/media/sitemap.xml
/media/sitemap/sitemap.xml
/media/sitemap/index.xml

Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`).

Some websites also provide an HTML version, to help indexing bots find new content. Those include:

- /sitemap
- /category-sitemap
- /sitemap.html
- /sitemap_index
/sitemap
/category-sitemap
/sitemap.html
/sitemap_index

Apify provides the [Sitemap Sniffer](https://apify.com/vaclavrut/sitemap-sniffer), an open source actor that scans the URL variations automatically for you so that you don't have to check them manually.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ slug: /advanced-web-scraping/crawling/crawling-with-search

In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination.

Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic.
Limiting pagination is a common practice on e-commerce sites. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic.

![Pagination in on Google search results page](./images/pagination.png)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ sidebar_position: 1
slug: /advanced-web-scraping/crawling/sitemaps-vs-search
---

The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.
The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the [Web Scraping for Beginners course](/academy/web-scraping-for-beginners).

Unfortunately, _most modern websites restrict pagination_ only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.

Expand Down
4 changes: 1 addition & 3 deletions sources/academy/webscraping/advanced_web_scraping/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@ category: web scraping & automation
slug: /advanced-web-scraping
---

# Advanced web scraping

In [Web scraping for beginners](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face.

In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper.
Expand All @@ -16,7 +14,7 @@ In this course, we will take all of that knowledge, add a few more advanced conc

To scrape large and complex websites, we need to scale two essential aspects of the scraper: crawling and data extraction. Big websites can have millions of pages and the data we want to extract requires more sophisticated parsing techniques than just selecting elements by CSS selectors or using APIs as they are.

<!--
<!-- WIP: We want to split this into crawling and data extraction
The following sections will cover the core concepts that will ensure that your scraper is production-ready:
The advanced crawling section will cover how to ensure we find all pages or products on the website.
- The advanced data extraction will cover how to efficiently extract data from a particular page or API.
Expand Down

0 comments on commit 4276746

Please sign in to comment.