Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threading perfomance suboptimal #1070

Open
antoine-levitt opened this issue Feb 26, 2025 · 7 comments
Open

Threading perfomance suboptimal #1070

antoine-levitt opened this issue Feb 26, 2025 · 7 comments

Comments

@antoine-levitt
Copy link
Member

Not sure what to do about it but I'm recording it here for future reference. I'm trying to do relatively large scale computations (~1h each run) sweeping across a large set of parameters, so I'm just including DFTK in a @threads for loop, and disabling DFTK threading inside.

  • I had weird errors with timeroutputs (annoyingly, after the SCF), so I disabled it. Should we try to detect that DFTK is run within threads and warn about timer outputs ?
  • Parallel efficiency seems very bad. I run 16 threads on a large computer (~100 cores), and top reports 2-4 CPU usage when actually computing (but 16 CPU for the init stage). On a run (with smaller parameters) I got the following output (times 16)
 38.674739 seconds (48.16 M allocations: 153.325 GiB, 30.36% gc time, 77 lock conflicts, 898.42% compilation time: <1% of which was recompilation)

which I'm not sure how to interpret.

Probably the solution for optimal performance in this use case is to use distributed or MPI, but threading is simple to use...

@antoine-levitt
Copy link
Member Author

OK, on the big system:

6436.839131 seconds (391.31 M allocations: 32.890 TiB, 6.65% gc time, 61 lock conflicts, 4.70% compilation time: <1% of which was recompilation)

so gc and compilation are not the source of the bad parallel efficiency...

@mfherbst
Copy link
Member

Blas default threading is sometimes really bad. In case not done maybe disable that.

And yes our threading perf really sucks I agree. We therefore hardly use it.

@mfherbst mfherbst changed the title Threading woes Threading perfomance suboptimal Feb 26, 2025
@antoine-levitt
Copy link
Member Author

This is with all (DFTK, BLAS, FFTW) threadings disabled, it's just a threaded loop on top of DFTK. So it's not really a DFTK issue, more of a julia one, but I'm really not sure what's going on.

@antoine-levitt
Copy link
Member Author

People on slack sent me to https://discourse.julialang.org/t/inconsistent-cpu-utilisation-in-threads-loops/110512/12. Basically even if it reports 5% GC, it's possibly much more in practice. I've tried turning off the gc entirely and that does something weird (CPU usage goes from 10 to 0.5), but it's probably a sign that it's GC-related. So either we try to use less allocations (this particular code is using direct minimization, which is possibly not that well optimized) or I switch to not using threads.

@mfherbst
Copy link
Member

mfherbst commented Feb 26, 2025

So either we try to use less allocations

I think that would also be beneficial for GPU use cases. I don't think we are careful in this regard in the DM code at all right now. Maybe this changes when the ManOpt integration is ready.

@antoine-levitt
Copy link
Member Author

I've solved this issue by using Distributed with pmap, which is convenient enough, instead of threads.

@mfherbst
Copy link
Member

Ok, good to know !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants