Alternatives to html5-parser

Installing html5-parser correctly is a bit of a pain, and has caused problems for some users of this package (see #196 for making it optional). Much of the code here was originally written almost a decade ago, and html5-parser seemed to be the only solid and fast HTML5 parser in Python (moving from lxml to an HTML5-compliant parser was a huge and noticeable improvement for analysts at EDGI, so it’s important that we maintain that). Are there other viable options now?

The two mature(ish) options seem to be [Selectolax](https://selectolax.readthedocs.io) (based on Lexbor, written in C and Cython) and [Markupever](https://awolverp.github.io/markupever/) (based on html5ever, written in Rust — this is Servo’s parser). I hacked up html5-parser’s benchmark to do a quick and simple comparison: https://gist.github.com/Mr0grog/63667aec279f2035c036501bbd876d1a

- **Selectolax** is 3-5 times faster than html5-parser! That’s pretty incredible.
- **Markupever** seems to parser about twice as fast as html5-parser, but walking the DOM is only about half as fast. But overall they are in a similar ballpark.

The big downside with both of these is that, unlike html5-parser, they return their own, entirely non-standard tree representations. html5-parser returns Python’s standard ElementTree format, which is IMO terrible (why so many people use BeautifulSoup), but does mean it’s interchangeable with other tools, e.g. html5lib. Working with the trees from Selectolax or Markupever each requires different, custom code. There is a project called [domselect](https://github.com/lorien/domselect) that provides a standard interface for both lxml (and therefore, I *think* ElementTree) and Selectolax, but it doesn’t support Markupever. It also probably brings some of its own overhead.

Both have been around for several years, but are pre-1.0. That also describes html5-parser, though, so I don’t think it’s a big deal. They’ve been around long enough that they seem like they’ll be at least as well supported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alternatives to html5-parser #222

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Alternatives to html5-parser #222

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions