HTML parser

The json, xml and html parsers have the most difficult source files after the dynamic parser. It is a JSON file, and it can ΝΟΤ run only by providing an url, a lot of configuration is needed.

It is a must to use tools such as the inspect element tool of the browser to find the CSS selectors or the name of the classes. It is recommended to use CSS selectors instead of classes so you can pinpoint the desired element accurately.

Scrape

`container`

The identifier of the articles' container in the HTML code. It can be a class, a css selector, a x-path or a js path.

The selected element will then be iterated for each child element and execute the options inside article.

For example, for this HTML code the container will be .the-container >.

<div class="the-container">
    <!-- It will iterate all the divs inside container -->
    <div id="article1">...</div>
    <div id="article2">...</div>
    <div id="article3">...</div>
    ...
</div>

`scriptingEnabled`

Default: true

From Cheerio documentation, If set to true, noscript element content will be parsed as text.

`article`

It will contain all the configuration needed to fill each article field.

The fields and their options are in a key-value relationship.

{
    "title": {/* options */},
    "content": {/* options */},
    // ...
}

The following standard keys will be assigned as the root fields in the article: title, content, pubDate, link, attachments, categories and thumbnail. All the other fields mentioned here will be moved to the extras field in a key-value relationship.

`class`

The identifier of the article in the HTML code. It can be a class, a css selector, a x-path or a js path.

If the HTML tag cannot be identified by the above, this field can be omitted.

For example, for this HTML code the class for the title will be .title.title2.

<div class="the-container">
    <div id="article1">
        <a class="title title2" href="article-url">Title of the article</a>
    </div>
</div>

`find`

How deep to go in the HTML code step by step. It can be a class, a css selector, a x-path or a js path.

When multiple is set to true, it will take the last item in the list and iterate the elements.

For example, for this HTML code the class for the title will be ["h5", "a"].

<div class="the-container">
    <div id="article1">
        <h5>
            <a href="article-url">Title of the article</a>
        </h5>
    </div>
</div>

`attributes`

When the value of one or more attributes is needed.

For example, when the link of the article is the href attribute of an a tag that contains the title.

<a href="link-of-the-article">Title of the article</a>

Instead of text it will an array of objects containing the attribute which was requested, the value of the attribute and the text of the tag.

{
    "attribute": "href",
    "value": "link-of-the-article",
    "text": "Title of the article"
}

`multiple`

Default value: false

If set to true, it will retrieve multiple elements instead of the first one. It will iterate the last element of the find option.

For example, If you want to get all the tags inside ul:

<ul>
    <li>1</li>
    <li>2</li>
    <li>3</li>
    ...
</ul>

It will be like

{
    "field": ["ul", "li"],
    "multiple": true
}

Make note that the standard keys will only assign the first element in the array and skip the others.

`parent`

There are cases where the needed information is displayed in multiple places.

This option will append the current result found at the end of the parent option. In case of multiple options having the same parent, they results will be appended in the order in which they were defined.

This HTML code:

<div id="article1">
    <a class="first">First part of title</a>
    <p class="second">Second part of title</p>
    <p class="third">Third part of title</p>
</div>

will have the following options:

{
    // ...
    "article": {
        "title": {
            "class": "first",
        },
        "title-second": {
            "class": "second",
            "parent": "title"
        },
        "title-third": {
            "class": "third",
            "parent": "title"
        }
        // ...
    }
}

The extra options title-second and title-third will also be added to the extras field of the article.

`static`

Assign a static string to the specified field. All the other options are omitted.

`skip`

When it is needed to skip an element inside container that matches some criteria.

You can skip an element based on:

the position of that element inside the document
a selector that matches that element
that element's text
both cases 2 and 3

Based on position

Position always starts from zero.

{
    // ...
    "skip": [
        {
            "position": 3 // skip the fourth element
        }
    ]
}

Based on selector

The selector can be a class, a css selector, a x-path or a js path

{
    // ...
    "skip": [
        {
            "selector": "div.class1 > div.class2"
        }
    ]
}

Based on text

The element's text will undergo a sanitization based on these rules:

Html decoding (&gt; will tranform to >)
Remove all consecutive \n, \t and spaces.

{
    // ...
    "skip": [
        {
            "text": "Text to skip",
            "type": "exact" // or "contains"
        }
    ]
}

Based on selector and text

This is used to match the text of a specific child element inside the element.

The element's text will undergo a sanitization based on these rules:

Html decoding (&gt; will tranform to >)
Remove all consecutive \n, \t and spaces.

{
    // ...
    "skip": [
        {
            // Skip all articles that have in their title the text "Discovery on Mars"
            "selector": "div.title",
            "text": "Discovery on Mars",
            "type": "contains"
        }
    ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html.md

html.md

HTML parser

Scrape

`container`

`scriptingEnabled`

`article`

`class`

`find`

`attributes`

`multiple`

`parent`

`static`

`skip`

Based on position

Based on selector

Based on text

Based on selector and text

Files

html.md

Latest commit

History

html.md

File metadata and controls

HTML parser

Scrape

container

scriptingEnabled

article

class

find

attributes

multiple

parent

static

skip

Based on position

Based on selector

Based on text

Based on selector and text

`container`

`scriptingEnabled`

`article`

`class`

`find`

`attributes`

`multiple`

`parent`

`static`

`skip`