Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ and Perl's [Web::Scraper](http://search.cpan.org/~miyagawa/Web-Scraper-0.38/),
and provides a declarative, monadic interface on top of the robust
HTML parsing library [TagSoup](http://hackage.haskell.org/package/tagsoup)

Performance
-----------

- scalpel 0.6.2.2: 448 ms, peak memory: 467 MB
- Rust's scraper 0.23.1: 6.5 ms, peak memory: 3 MB

Quickstart
----------

Expand Down
22 changes: 19 additions & 3 deletions scalpel-core/benchmarks/Main.hs
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@

import Text.HTML.Scalpel.Core

import Control.Applicative ((<$>))
import Control.Monad (replicateM_)
import Criterion.Main (bgroup, bench, defaultMain, nf)
import Data.Foldable (foldr')
import Criterion.Measurement
import Criterion.Measurement.Types (Measured(..))
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Text.HTML.TagSoup as TagSoup


Expand All @@ -15,8 +17,14 @@ main = do
let nested100 = makeNested 100
let nested1000 = makeNested 1000
let nested10000 = makeNested 10000
-- permalink: https://en.wikipedia.org/w/index.php?title=New_York_City&oldid=1292263955
wikipediaArticle <- T.readFile "benchmarks/wikipedia-new-york-city.html"
measureMemory wikipediaArticle
defaultMain [
bgroup "nested" [
bgroup "all h2 on 1.6 MiB Wikipedia article `New York City`" [
bench "timings" (nf (scrapeStringLike wikipediaArticle) (texts "h2"))
]
, bgroup "nested" [
bench "100" $ nf sumListTags nested100
, bench "1000" $ nf sumListTags nested1000
, bench "10000" $ nf sumListTags nested10000
Expand All @@ -31,7 +39,7 @@ main = do
, bench "100" $ nf (manySelectNodes 100) nested1000
, bench "1000" $ nf (manySelectNodes 1000) nested1000
]
]
]

makeNested :: Int -> [TagSoup.Tag T.Text]
makeNested i = TagSoup.parseTags
Expand All @@ -55,3 +63,11 @@ manySelectNodes i testData = flip scrape testData
$ text
$ foldr' (//) (tagSelector "tag")
$ replicate (i - 1) (tagSelector "tag")

measureMemory :: T.Text -> IO ()
measureMemory t = do
m <- measure (nf (scrapeStringLike t) (texts "h2")) 1
let pma = (show . measPeakMbAllocated . fst) m
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was introduced in criterion 1.6.

Could you try adding that to extra-deps in stack.yaml? If that doesn't work let's just submit as is and I'll fiddle with the github actions to exclude benchmarks on older resolvers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

measPeakMbAllocated as well as measure come with criterion-measurement: https://hackage.haskell.org/package/criterion-measurement-0.2.3.0/docs/Criterion-Measurement-Types.html
I've already added that under scalpel-core.cabal bench section

Did I miss something? Let me know if…

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI is failing on older stack resolvers that have a version of criterion-measurement before measPeakMbAllocated was introduced.

I think you can fix those by adding an extra-dep with a more recent version, e.g. 0.2.3.0, similar to here (I don't think the sha is actually necessary).

putStrLn "running memory test"
putStrLn " scraping all h2 on 1.6 MiB Wikipedia article `New York City`"
putStrLn $ " peak memory allocated: " ++ pma ++ " MiB"
Loading
Loading