Skip to content

evalPrimOp is a bottle-neck due to linear search #8

@sgraf812

Description

@sgraf812

While the implementation of, e.g., ...ByteArray.evalPrimOp is rather direct and elegant at the moment

newtype Name = BS8.ByteString

evalPrimOp :: PrimOpEval -> Name -> [Atom] -> Type -> Maybe TyCon -> M [Atom]
evalPrimOp fallback op args t tc = case (op, args) of

  -- newByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
  ( "newByteArray#", [IntV size, _s]) -> do
    baIdx <- newByteArray size 1 False
    pure [MutableByteArray baIdx]


  -- newPinnedByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
  ( "newPinnedByteArray#", [IntV size, _s]) -> do
    baIdx <- newByteArray size 1 True
    pure [MutableByteArray baIdx]

It yields rather slow code. That is because GHC doesn't do the "obvious" trie-based match optimisation, which means we end up with a linear chain of comparisons

if op == "newByteArray#" then ,,,
else if op == "newPinnedByteArray#" then ...
...

in Core.

I took a profile (on nofib"s bernoulli, if that matters) and ...ByteArray.evalPrimOp takes about 10% of time and allocation. Here is an excerpt of the profile

COST CENTRE                          MODULE                            SRC                                                       %time %alloc

evalPrimOp                           Stg.Interpreter.PrimOp.ByteArray  lib/Stg/Interpreter/PrimOp/ByteArray.hs:(71,1)-(884,28)     9.4   10.9
evalPrimOp                           Stg.Interpreter.PrimOp.Addr       lib/Stg/Interpreter/PrimOp/Addr.hs:(23,1)-(357,28)          6.5    8.2
evalStackContinuation.\              Stg.Interpreter                   lib/Stg/Interpreter.hs:(355,74)-(394,35)                    4.4    5.8
lookupEnvSO                          Stg.Interpreter.Base              lib/Stg/Interpreter/Base.hs:(630,1)-(648,21)                4.2    2.2
builtinStgEval                       Stg.Interpreter                   lib/Stg/Interpreter.hs:(154,1)-(201,103)                    3.2    2.9
evalPrimOp                           Stg.Interpreter.PrimOp.Float      lib/Stg/Interpreter/PrimOp/Float.hs:(17,1)-(120,28)         2.8    2.9
evalExpr.\                           Stg.Interpreter                   lib/Stg/Interpreter.hs:(497,45)-(502,23)                    2.8    3.4
evalPrimOp                           Stg.Interpreter.PrimOp.Double     lib/Stg/Interpreter/PrimOp/Double.hs:(17,1)-(127,28)        2.7    3.0
evalExpr                             Stg.Interpreter                   lib/Stg/Interpreter.hs:(423,1)-(533,93)                     2.5    0.6
compare                              Stg.Syntax                        lib/Stg/Syntax.hs:(30,3)-(32,12)                            2.4    0.0
evalExpr.\                           Stg.Interpreter                   lib/Stg/Interpreter.hs:(504,37)-(510,27)                    2.4    1.2
setInsert                            Stg.Interpreter.Base              lib/Stg/Interpreter/Base.hs:(776,1)-(778,36)                2.1    0.0
evalPrimOp                           Stg.Interpreter.PrimOp.Int        lib/Stg/Interpreter/PrimOp/Int.hs:(25,1)-(149,28)           2.0    2.1
evalPrimOp                           Stg.Interpreter.PrimOp.SmallArray lib/Stg/Interpreter/PrimOp/SmallArray.hs:(27,1)-(157,28)    1.9    1.9

I think a bit of focus on optimising evalPrimOp may well speed up the interpreter by 50%. One way to do so would perhaps be to use a HashMap or Trie to do the lookup.

Why optimise anyway? Because at the moment a single run of NoFib's bernoulli benchmark takes about half an hour when the compiled program takes just 0.1s. That's quite a deal breaker for an exhaustive benchmark run of all 11* benchmarks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions