For this library to be remotely competitive, almost every routine will need to be heavily optimized. We'll start with the single-threaded version.
Right now, the potential evaluators have been reasonably well optimized, though there's more work everywhere else. We should write some basic benchmarking code, similar to what's already being using with nanobench inside examples/ewald_total.cpp. We should avoid benchmarking pure library calls like ducc0's fft, but any routines we write that appear somewhat intensive should be optimized. A good start might be the application of G(k) after the first application of the fourier transform.
Profile the code, find the hotspots, benchmark them for a baseline, and then and refactor them or SIMD optimize them to be efficient as possible. We're currently using SCTL for SIMD vectorization, so you can look at the code we're using for the potential calculation as a reference for manually vectorizing some loops.
For this library to be remotely competitive, almost every routine will need to be heavily optimized. We'll start with the single-threaded version.
Right now, the potential evaluators have been reasonably well optimized, though there's more work everywhere else. We should write some basic benchmarking code, similar to what's already being using with
nanobenchinside examples/ewald_total.cpp. We should avoid benchmarking pure library calls likeducc0's fft, but any routines we write that appear somewhat intensive should be optimized. A good start might be the application ofG(k)after the first application of the fourier transform.Profile the code, find the hotspots, benchmark them for a baseline, and then and refactor them or SIMD optimize them to be efficient as possible. We're currently using SCTL for SIMD vectorization, so you can look at the code we're using for the potential calculation as a reference for manually vectorizing some loops.