-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threading perfomance suboptimal #1070
Comments
OK, on the big system:
so gc and compilation are not the source of the bad parallel efficiency... |
Blas default threading is sometimes really bad. In case not done maybe disable that. And yes our threading perf really sucks I agree. We therefore hardly use it. |
This is with all (DFTK, BLAS, FFTW) threadings disabled, it's just a threaded loop on top of DFTK. So it's not really a DFTK issue, more of a julia one, but I'm really not sure what's going on. |
People on slack sent me to https://discourse.julialang.org/t/inconsistent-cpu-utilisation-in-threads-loops/110512/12. Basically even if it reports 5% GC, it's possibly much more in practice. I've tried turning off the gc entirely and that does something weird (CPU usage goes from 10 to 0.5), but it's probably a sign that it's GC-related. So either we try to use less allocations (this particular code is using direct minimization, which is possibly not that well optimized) or I switch to not using threads. |
I think that would also be beneficial for GPU use cases. I don't think we are careful in this regard in the DM code at all right now. Maybe this changes when the ManOpt integration is ready. |
I've solved this issue by using Distributed with pmap, which is convenient enough, instead of threads. |
Ok, good to know ! |
Not sure what to do about it but I'm recording it here for future reference. I'm trying to do relatively large scale computations (~1h each run) sweeping across a large set of parameters, so I'm just including DFTK in a
@threads for
loop, and disabling DFTK threading inside.top
reports 2-4 CPU usage when actually computing (but 16 CPU for the init stage). On a run (with smaller parameters) I got the following output (times 16)which I'm not sure how to interpret.
Probably the solution for optimal performance in this use case is to use distributed or MPI, but threading is simple to use...
The text was updated successfully, but these errors were encountered: