Starting multiple Thunks effectively #333

DrChainsaw · 2022-02-22T21:49:17Z

As asked for in #331.

In FileTrees you start with a FileTree where some nodes in the tree are Thunks and we'd like to compute them in parallel. Current way is to collect them in an array and splat them, but this is not so good for performance as the tree can have alot of values.

# Note using 0.14.1 because 0.14.3 is a bit different
(Dagger) pkg> status
     Project Dagger v0.14.1
      Status `E:\Programs\julia\.julia\dev\Dagger\Project.toml`

using Dagger, Distributed

addprocs(; exeflags="--project");

@everywhere using Dagger, Distributed

function testfun(n)
      @time daggerres = collect(delayed(vcat)([delayed(+)(i,1) for i in 1:n]...))
      @time distres = pmap(i -> i+1, 1:n)
      return daggerres, distres
end

# First call we get compilation overhead for both Dagger and pmap
testfun(4)
# 16.981486 seconds (18.41 M allocations: 997.734 MiB, 3.62% gc time, 60.71% compilation time)
#  2.639290 seconds (918.68 k allocations: 49.862 MiB, 0.77% gc time, 40.18% compilation time)
# ([2, 3, 4, 5], [2, 3, 4, 5])

# Now there is no compilation time
testfun(4)
#  0.034817 seconds (14.17 k allocations: 834.211 KiB)
#  0.855902 seconds (695 allocations: 34.938 KiB)
# ([2, 3, 4, 5], [2, 3, 4, 5])

# But if we change the size the dagger version needs to be recompiled :(
testfun(5)
#  0.583187 seconds (731.57 k allocations: 40.110 MiB, 5.19% gc time, 92.81% compilation time)
#  0.024653 seconds (415 allocations: 22.359 KiB)
#([2, 3, 4, 5, 6], [2, 3, 4, 5, 6])

# And it scales kinda badly
testfun(1000);
# 14.335327 seconds (9.30 M allocations: 502.337 MiB, 1.45% gc time, 51.53% compilation time)
#  0.342966 seconds (62.32 k allocations: 2.515 MiB)

# Operation is really fast ofc, not sure why time though it was 50% above
testfun(1000);
  0.804885 seconds (2.19 M allocations: 103.171 MiB, 7.19% gc time, 12.20% compilation time)
  0.237457 seconds (62.56 k allocations: 2.455 MiB)

testfun(2000);
# 76.037216 seconds (18.18 M allocations: 1019.668 MiB, 0.68% gc time, 97.74% compilation time)
# 0.464703 seconds (124.33 k allocations: 4.741 MiB)

testfun(2000);
#  1.996002 seconds (4.40 M allocations: 215.335 MiB, 13.17% gc time)
#  0.875414 seconds (147.93 k allocations: 5.685 MiB)

# So close, yet so far :)
testfun(2001);
# 78.877148 seconds (18.27 M allocations: 1.001 GiB, 0.64% gc time, 97.54% compilation time)
#  0.436966 seconds (160.21 k allocations: 6.884 MiB, 14.65% gc time)

testfun(2001);
#  1.649533 seconds (4.36 M allocations: 206.494 MiB, 3.84% gc time)
#  0.957735 seconds (147.96 k allocations: 5.424 MiB)

The text was updated successfully, but these errors were encountered:

DrChainsaw · 2022-02-22T21:53:14Z

Part 2: Here is the 0.14.3 performance.

It was a much bigger difference on the other machine I tried (the one used in shashi/FileTrees.jl#63 (comment)), so perhaps this is not something worth bothering about:

testfun(4);
# 19.902917 seconds (19.41 M allocations: 1.029 GiB, 2.37% gc time, 54.35% compilation time)
#  2.539799 seconds (919.27 k allocations: 49.906 MiB, 1.09% gc time, 34.90% compilation time)

testfun(4)
#  3.941361 seconds (39.94 k allocations: 2.120 MiB, 1.43% compilation time)
#  0.805246 seconds (535 allocations: 29.375 KiB)


testfun(1000);
# 11.674108 seconds (9.43 M allocations: 532.586 MiB, 1.67% gc time, 50.98% compilation time)
#  0.743769 seconds (51.65 k allocations: 2.042 MiB)

testfun(1000);
#  1.773273 seconds (1.70 M allocations: 99.279 MiB, 3.36% gc time)
#  0.352158 seconds (61.14 k allocations: 2.329 MiB)

testfun(2000);
# 71.023676 seconds (17.19 M allocations: 1011.429 MiB, 0.89% gc time, 96.29% compilation time)
#  0.521013 seconds (113.83 k allocations: 4.296 MiB)

testfun(2000);
 # 3.602684 seconds (3.44 M allocations: 206.856 MiB, 7.15% gc time)
 # 0.498256 seconds (118.14 k allocations: 4.507 MiB)

jpsamaroo · 2022-02-23T19:49:37Z

Thanks to the glorious magic of SnoopCompile, I've found a ton of expensive inference triggers that are trivial to fix (mostly due to calling into broadcast). I'll try to crush as many as possible with this MWE, and then will post a PR. Thanks for the report and excellent reproducer!

jpsamaroo self-assigned this Feb 23, 2022

jpsamaroo added bug performance labels Feb 23, 2022

jpsamaroo mentioned this issue Feb 24, 2022

thunk,Sch: Reduce invalidations and TTFX #334

Merged

jpsamaroo closed this as completed in #334 Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starting multiple Thunks effectively #333

Starting multiple Thunks effectively #333

DrChainsaw commented Feb 22, 2022

DrChainsaw commented Feb 22, 2022

jpsamaroo commented Feb 23, 2022

Starting multiple Thunks effectively #333

Starting multiple Thunks effectively #333

Comments

DrChainsaw commented Feb 22, 2022

DrChainsaw commented Feb 22, 2022

jpsamaroo commented Feb 23, 2022