Skip to content

Conversation

@benjaminweb
Copy link

  • fixes all (temporarily) broken tests
  • fixes (temporarily) missing type annotation for scrapeURL
  • includes time and memory benchmark allowing comparing based on real world example with previous version 0.6.2.2
  • @fimad might require renaming of any function with Tag to Token -- would that be a breaking change?

@benjaminweb benjaminweb changed the title switch to html-parse for faster tokenisation (1/10th of time and peak memory allocated) Switching to html-parse for faster tokenisation (1/10th of time and peak memory allocated) May 26, 2025
Copy link
Owner

@fimad fimad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fimad might require renaming of any function with Tag to Token -- would that be a breaking change?

Let's not change function names. If we were starting from scratch, maybe it would make sense to call things token. But I think "tag" is a reasonable term to use for HTML outside the context of TagSoup that I don't think it is worth a breaking change just to follow our underlying HTML parser's terminology.

-- | A value of 'Scraper' @a@ defines a web scraper that is capable of consuming
-- a list of 'TagSoup.Tag's and optionally producing a value of type @a@.
type Scraper str = ScraperT str Identity
-- a list of 'HP.Tag's and optionally producing a value of type @a@.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'HP.Tag' should be 'HP.Token'? See a couple instances throughout.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved with 69b57d4.
(Remaining matches are entities from HP that is Text.HTML.Parser itself.)

@benjaminweb
Copy link
Author

benjaminweb commented Jun 3, 2025 via email

@fimad
Copy link
Owner

fimad commented Aug 2, 2025

The CI is currently failing because stack.yaml is missing the html-parse dependency:

extra-deps:
- html-parse-0.2.2.0@sha256:9d5b4069e2b04894918a07fb7d88b0bf609d5aa918e020e2ce2fcb2d957ff487,4480

After adding it, it looks like there are still some build errors in the examples.

@benjaminweb
Copy link
Author

Added html-parse as stack extra-dep with c9edff2.
Can't reproduce the build errors with cabal and ghc 9.12.2 and cabal run example-from-documentation (from examples dir). Tested each example and each passes with my configuration. So it can't be the code itself that is wrong.
How should we proceed?

@benjaminweb benjaminweb requested a review from fimad October 23, 2025 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants