-
-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Installing html5-parser correctly is a bit of a pain, and has caused problems for some users of this package (see #196 for making it optional). Much of the code here was originally written almost a decade ago, and html5-parser seemed to be the only solid and fast HTML5 parser in Python (moving from lxml to an HTML5-compliant parser was a huge and noticeable improvement for analysts at EDGI, so it’s important that we maintain that). Are there other viable options now?
The two mature(ish) options seem to be Selectolax (based on Lexbor, written in C and Cython) and Markupever (based on html5ever, written in Rust — this is Servo’s parser). I hacked up html5-parser’s benchmark to do a quick and simple comparison: https://gist.github.com/Mr0grog/63667aec279f2035c036501bbd876d1a
- Selectolax is 3-5 times faster than html5-parser! That’s pretty incredible.
- Markupever seems to parser about twice as fast as html5-parser, but walking the DOM is only about half as fast. But overall they are in a similar ballpark.
The big downside with both of these is that, unlike html5-parser, they return their own, entirely non-standard tree representations. html5-parser returns Python’s standard ElementTree format, which is IMO terrible (why so many people use BeautifulSoup), but does mean it’s interchangeable with other tools, e.g. html5lib. Working with the trees from Selectolax or Markupever each requires different, custom code. There is a project called domselect that provides a standard interface for both lxml (and therefore, I think ElementTree) and Selectolax, but it doesn’t support Markupever. It also probably brings some of its own overhead.
Both have been around for several years, but are pre-1.0. That also describes html5-parser, though, so I don’t think it’s a big deal. They’ve been around long enough that they seem like they’ll be at least as well supported.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status