Threading perfomance suboptimal #1070

antoine-levitt · 2025-02-26T09:23:33Z

Not sure what to do about it but I'm recording it here for future reference. I'm trying to do relatively large scale computations (~1h each run) sweeping across a large set of parameters, so I'm just including DFTK in a @threads for loop, and disabling DFTK threading inside.

I had weird errors with timeroutputs (annoyingly, after the SCF), so I disabled it. Should we try to detect that DFTK is run within threads and warn about timer outputs ?
Parallel efficiency seems very bad. I run 16 threads on a large computer (~100 cores), and top reports 2-4 CPU usage when actually computing (but 16 CPU for the init stage). On a run (with smaller parameters) I got the following output (times 16)

 38.674739 seconds (48.16 M allocations: 153.325 GiB, 30.36% gc time, 77 lock conflicts, 898.42% compilation time: <1% of which was recompilation)

which I'm not sure how to interpret.

Probably the solution for optimal performance in this use case is to use distributed or MPI, but threading is simple to use...

The text was updated successfully, but these errors were encountered:

antoine-levitt · 2025-02-26T11:11:02Z

OK, on the big system:

6436.839131 seconds (391.31 M allocations: 32.890 TiB, 6.65% gc time, 61 lock conflicts, 4.70% compilation time: <1% of which was recompilation)

so gc and compilation are not the source of the bad parallel efficiency...

mfherbst · 2025-02-26T11:32:42Z

Blas default threading is sometimes really bad. In case not done maybe disable that.

And yes our threading perf really sucks I agree. We therefore hardly use it.

antoine-levitt · 2025-02-26T11:35:39Z

This is with all (DFTK, BLAS, FFTW) threadings disabled, it's just a threaded loop on top of DFTK. So it's not really a DFTK issue, more of a julia one, but I'm really not sure what's going on.

antoine-levitt · 2025-02-26T13:25:34Z

People on slack sent me to https://discourse.julialang.org/t/inconsistent-cpu-utilisation-in-threads-loops/110512/12. Basically even if it reports 5% GC, it's possibly much more in practice. I've tried turning off the gc entirely and that does something weird (CPU usage goes from 10 to 0.5), but it's probably a sign that it's GC-related. So either we try to use less allocations (this particular code is using direct minimization, which is possibly not that well optimized) or I switch to not using threads.

mfherbst · 2025-02-26T14:08:56Z

So either we try to use less allocations

I think that would also be beneficial for GPU use cases. I don't think we are careful in this regard in the DM code at all right now. Maybe this changes when the ManOpt integration is ready.

antoine-levitt · 2025-02-26T15:08:28Z

I've solved this issue by using Distributed with pmap, which is convenient enough, instead of threads.

mfherbst · 2025-02-26T15:22:30Z

Ok, good to know !

mfherbst changed the title ~~Threading woes~~ Threading perfomance suboptimal Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threading perfomance suboptimal #1070

Threading perfomance suboptimal #1070

antoine-levitt commented Feb 26, 2025

antoine-levitt commented Feb 26, 2025

mfherbst commented Feb 26, 2025

antoine-levitt commented Feb 26, 2025

antoine-levitt commented Feb 26, 2025

mfherbst commented Feb 26, 2025 •

edited

Loading

antoine-levitt commented Feb 26, 2025

mfherbst commented Feb 26, 2025

Threading perfomance suboptimal #1070

Threading perfomance suboptimal #1070

Comments

antoine-levitt commented Feb 26, 2025

antoine-levitt commented Feb 26, 2025

mfherbst commented Feb 26, 2025

antoine-levitt commented Feb 26, 2025

antoine-levitt commented Feb 26, 2025

mfherbst commented Feb 26, 2025 • edited Loading

antoine-levitt commented Feb 26, 2025

mfherbst commented Feb 26, 2025

mfherbst commented Feb 26, 2025 •

edited

Loading