-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow parallel performance on large file trees #63
Comments
I can reproduce (although I get about half the times). Seems like something is cursed with the tree = maketree("root" => string.(1:3600));
addprocs(exeflags=["--threads=1", "--project"])
4-element Vector{Int64}:
2
3
4
5
@everywhere using FileTrees , Distributed
using FileTrees.Dagger
function dtest(thunks)
@time chunks = compute(delayed((xs...) -> [xs...]; meta=true)(thunks...))
@time asyncmap(collect, chunks)
end
# first runs excluded as I moved things around and cba to restart. First run took about 300 seconds for me
# What we want to compute
thunks_okish = [delayed(+)(2,2) for _ in 1:3600];
# Not great, but better than below
@time dtest(thunks_okish);
#12.019530 seconds (2.65 M allocations: 187.081 MiB, 0.79% gc time, 1.34% compilation time)
#0.174506 seconds (294.31 k allocations: 12.355 MiB)
#12.221474 seconds (3.01 M allocations: 203.461 MiB, 0.77% gc time, 1.54% compilation time)
# Same thing from the FileTree
thunks_notok = values(FileTrees.load(f -> 2+2, tree, lazy=true));
@time dtest(thunks_notok);
#47.549520 seconds (80.41 M allocations: 2.002 GiB, 1.34% gc time, 0.05% compilation time)
#0.215092 seconds (267.86 k allocations: 11.501 MiB)
#47.764741 seconds (80.68 M allocations: 2.013 GiB, 1.33% gc time, 0.05% compilation time)
# Maybe the anonymous funciton or some type instability with File is the problem?
thunks_file = [delayed(f -> 2+2)(File(nothing, "root/$i")) for i in 1:3600];
# Does not seem like it
@time dtest(thunks_file);
#12.157954 seconds (2.57 M allocations: 182.432 MiB, 0.94% gc time, 0.34% compilation time)
#0.170773 seconds (293.11 k allocations: 12.300 MiB)
#12.328912 seconds (2.87 M allocations: 194.741 MiB, 0.92% gc time, 0.33% compilation time)
# What if we add the whole tree as the parent
thunks_filetree = [delayed(f -> 2+2)(File(tree, "root/$i")) for i in 1:3600];
# That makes the difference!
@time dtest(thunks_filetree);
#48.115307 seconds (80.62 M allocations: 2.012 GiB, 2.14% gc time, 0.04% compilation time)
#0.190724 seconds (291.17 k allocations: 12.188 MiB)
#48.306223 seconds (80.92 M allocations: 2.024 GiB, 2.13% gc time, 0.04% compilation time) Could it be so that the time is spent hauling around the whole Here is a workaround that does just that: tree_names = FileTrees.load(name, tree);
@time exec(mapvalues(filename -> 2+2, tree_names; lazy=true));
#13.425377 seconds (3.93 M allocations: 260.063 MiB, 0.92% gc time, 3.84% compilation time) Fwiw, I first accidentally tried to reproduce with Dagger 0.14.1 and then I got much better times for subsequent runs: @time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
#266.588155 seconds (48.29 M allocations: 3.034 GiB, 0.78% gc time, 0.29% compilation time)
@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
#2.548957 seconds (3.03 M allocations: 184.613 MiB, 2.99% gc time, 8.65% compilation time)
@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
#2.948732 seconds (3.65 M allocations: 220.772 MiB, 2.44% gc time, 18.68% compilation time)
@time exec(FileTrees.load(f -> 2+2, tree));
#0.776767 seconds (1.06 M allocations: 63.422 MiB, 63.10% compilation time) @jpsamaroo : Any hints here? Is there a performance regression in As a side note, is there a way to have |
I didn't realize the whole tree was being passed around between the workers! That would probably explain it. But does this mean it has nothing to do with the NFS? |
I didn't realize that either at first :) I think this is a bit of an oopsie. Note that it only seems to happen for
The expression you timed does not touch the file system. The only interaction was when you created the tree (where it traversed the directory structure without looking at the file contents). After that the whole structure is an in memory copy and you would only access the file system if you actually used the name of a With the disclaimer that my familarity with the design of |
Right, that makes sense, but surely it has to query the file system at least to build the |
Yes, that happens locally in this call: tree = FileTree("my/path/here") After that you have an in-memory tree structure which just happens to match the file structure (as it was created from it), but is in no way connected to it. For example, you can remove/add Didn't mean to imply that there can be no issues with NFS, just that the poor performance in the OP does not seem to be related to NFS.
It should not be anything unexpected happening here. The function supplied in The "passing the whole tree" concern is just that the whole tree might be big and transferring it when the When loading there is ofc not alot of data attached to each file, but there could be something about serializing/deserializing a complex data structure even if it is not very sizey (I have no idea tbh). As seen in the |
Ok, I see. What about the Edit:
I guess that actually answers this question 😅 |
Values might be moved as the result of applying further transformations of the data (eg. through
|
I'm on vacation right now and don't have time to deeply investigate, but I will make a note to look into this when I get back on Monday! |
I can reproduce this, even with JuliaParallel/Dagger.jl#334. Performance is only bad with workers; using threads works fine, although there is no significant speedup. The reasons: compile time and GC time. Seems like we're spending 50%+ time compiling code, and 10-30% doing GC. (Note that this isn't easy to see with workers, which is why I often double-check with threads-only). I've verified with SnoopCompile that most of the inference time is in Dagger, so I'll see how much of it I can fix, before looking at GC overhead. |
Awesome, thanks alot! Here are numbers I get with Dagger 0.14.3 and current master: # Dagger 0.14.3
using FileTrees, Distributed
addprocs(exeflags=["--threads=1", "--project"]);
# Note: Different number of files (and also different machince iirc) compared to above.
tree = maketree("root" => string.(1:2000));
@everywhere using FileTrees, Distributed
@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
#34.666125 seconds (47.18 M allocations: 1.800 GiB, 2.27% gc time, 36.52% compilation time)
@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
#13.053516 seconds (26.61 M allocations: 726.402 MiB, 2.46% gc time, 2.00% compilation time)
# Dagger#master
@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
#34.271587 seconds (45.26 M allocations: 1.691 GiB, 2.09% gc time, 35.65% compilation time)
@time exec(FileTrees.load(f -> 2+2, tree, lazy=true));
#12.972358 seconds (26.36 M allocations: 705.501 MiB, 2.63% gc time, 3.00% compilation time) Do you have any thoughts on the performance implications of passing the whole julia> iob = PipeBuffer();
julia> @time serialize(iob, tree);
0.669635 seconds (1.32 M allocations: 76.045 MiB, 5.99% gc time, 92.14% compilation time)
julia> @time deserialize(iob);
0.375035 seconds (614.09 k allocations: 32.761 MiB, 3.98% gc time, 99.05% compilation time) |
I noticed that lazy computations on my
FileTree
were really, really slow (and always slower than sequential evaluation), so I did a quick test:where the path is replaced with some file tree (in my case it is a tree consisting of about 3600 files). This shouldn't matter that much for the test, I would think.
I run lazy
FileTrees.load
with a simple throwaway operation to check the overhead from using parallel processing. Now, I would fully expect parallelization to not pay off here given the simplicity of the operation. What is shocking (and quite frankly intolerable), however, is that the overhead adds several orders of magnitude to wall clock execution time.The first attempt took over 10 minutes compared to 1 second for sequential execution. This looks to be precompilation related given the speed-up on the second attempt. However, even then, it is 100x slower than sequential evaluation, which I think far exceeds reasonable overhead even for such a simple operation.
I am running this on a remote machine which is using a networked file system, so
I suspect that this may also be part of the problem.Am I missing anything or doing something stupid here? Is this a known issue?
EDIT: I updated the title to better reflect the suspected issue, per the feedback of @DrChainsaw .
The text was updated successfully, but these errors were encountered: