Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting multiple Thunks effectively #333

Closed
DrChainsaw opened this issue Feb 22, 2022 · 2 comments · Fixed by #334
Closed

Starting multiple Thunks effectively #333

DrChainsaw opened this issue Feb 22, 2022 · 2 comments · Fixed by #334
Assignees

Comments

@DrChainsaw
Copy link
Contributor

As asked for in #331.

In FileTrees you start with a FileTree where some nodes in the tree are Thunks and we'd like to compute them in parallel. Current way is to collect them in an array and splat them, but this is not so good for performance as the tree can have alot of values.

# Note using 0.14.1 because 0.14.3 is a bit different
(Dagger) pkg> status
     Project Dagger v0.14.1
      Status `E:\Programs\julia\.julia\dev\Dagger\Project.toml`

using Dagger, Distributed

addprocs(; exeflags="--project");

@everywhere using Dagger, Distributed

function testfun(n)
      @time daggerres = collect(delayed(vcat)([delayed(+)(i,1) for i in 1:n]...))
      @time distres = pmap(i -> i+1, 1:n)
      return daggerres, distres
end

# First call we get compilation overhead for both Dagger and pmap
testfun(4)
# 16.981486 seconds (18.41 M allocations: 997.734 MiB, 3.62% gc time, 60.71% compilation time)
#  2.639290 seconds (918.68 k allocations: 49.862 MiB, 0.77% gc time, 40.18% compilation time)
# ([2, 3, 4, 5], [2, 3, 4, 5])

# Now there is no compilation time
testfun(4)
#  0.034817 seconds (14.17 k allocations: 834.211 KiB)
#  0.855902 seconds (695 allocations: 34.938 KiB)
# ([2, 3, 4, 5], [2, 3, 4, 5])

# But if we change the size the dagger version needs to be recompiled :(
testfun(5)
#  0.583187 seconds (731.57 k allocations: 40.110 MiB, 5.19% gc time, 92.81% compilation time)
#  0.024653 seconds (415 allocations: 22.359 KiB)
#([2, 3, 4, 5, 6], [2, 3, 4, 5, 6])

# And it scales kinda badly
testfun(1000);
# 14.335327 seconds (9.30 M allocations: 502.337 MiB, 1.45% gc time, 51.53% compilation time)
#  0.342966 seconds (62.32 k allocations: 2.515 MiB)

# Operation is really fast ofc, not sure why time though it was 50% above
testfun(1000);
  0.804885 seconds (2.19 M allocations: 103.171 MiB, 7.19% gc time, 12.20% compilation time)
  0.237457 seconds (62.56 k allocations: 2.455 MiB)

testfun(2000);
# 76.037216 seconds (18.18 M allocations: 1019.668 MiB, 0.68% gc time, 97.74% compilation time)
# 0.464703 seconds (124.33 k allocations: 4.741 MiB)

testfun(2000);
#  1.996002 seconds (4.40 M allocations: 215.335 MiB, 13.17% gc time)
#  0.875414 seconds (147.93 k allocations: 5.685 MiB)

# So close, yet so far :)
testfun(2001);
# 78.877148 seconds (18.27 M allocations: 1.001 GiB, 0.64% gc time, 97.54% compilation time)
#  0.436966 seconds (160.21 k allocations: 6.884 MiB, 14.65% gc time)

testfun(2001);
#  1.649533 seconds (4.36 M allocations: 206.494 MiB, 3.84% gc time)
#  0.957735 seconds (147.96 k allocations: 5.424 MiB)
@DrChainsaw
Copy link
Contributor Author

Part 2: Here is the 0.14.3 performance.

It was a much bigger difference on the other machine I tried (the one used in shashi/FileTrees.jl#63 (comment)), so perhaps this is not something worth bothering about:

testfun(4);
# 19.902917 seconds (19.41 M allocations: 1.029 GiB, 2.37% gc time, 54.35% compilation time)
#  2.539799 seconds (919.27 k allocations: 49.906 MiB, 1.09% gc time, 34.90% compilation time)

testfun(4)
#  3.941361 seconds (39.94 k allocations: 2.120 MiB, 1.43% compilation time)
#  0.805246 seconds (535 allocations: 29.375 KiB)


testfun(1000);
# 11.674108 seconds (9.43 M allocations: 532.586 MiB, 1.67% gc time, 50.98% compilation time)
#  0.743769 seconds (51.65 k allocations: 2.042 MiB)

testfun(1000);
#  1.773273 seconds (1.70 M allocations: 99.279 MiB, 3.36% gc time)
#  0.352158 seconds (61.14 k allocations: 2.329 MiB)

testfun(2000);
# 71.023676 seconds (17.19 M allocations: 1011.429 MiB, 0.89% gc time, 96.29% compilation time)
#  0.521013 seconds (113.83 k allocations: 4.296 MiB)

testfun(2000);
 # 3.602684 seconds (3.44 M allocations: 206.856 MiB, 7.15% gc time)
 # 0.498256 seconds (118.14 k allocations: 4.507 MiB)

@jpsamaroo
Copy link
Member

Thanks to the glorious magic of SnoopCompile, I've found a ton of expensive inference triggers that are trivial to fix (mostly due to calling into broadcast). I'll try to crush as many as possible with this MWE, and then will post a PR. Thanks for the report and excellent reproducer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants