I was doing some reading, and I came across some interesting information that constrains the design of Grackle's API. I figured it would be good if we started to record this information.
Irrelevance for the initial port
A lot of this discussion isn't very relevant for our initial porting efforts. It's going to take some time to get the full implementation of solve_rate_cool converted so that it can run as a single kernel. While we work towards that goal, the API won't change much, if at all (we assume that external data is provided on the CPU, and we internally transfer it to and from the GPU).
Useful Basic Concepts
For developers that are less experienced with gpu programming, its instructive to briefly review two (related) concepts:
-
Asynchronous Kernel Execution: All work down by GPUs is performed in compute kernels that are asynchronously run
- After we launch a kernel, we are informed whether the launch was successful. The simplest thing to do is to then wait around until the kernel completes and then we could launch the next kernel.
- Alternatively, you can launch a sequence of kernels in a row and then wait for all those kernels to finish. This pipelining minimizes the amount of time the program spends waiting for communication between the CPU and GPU. (Most GPU-based simulation codes end up doing this)
- ASIDE: while our initial porting efforts for
solve_rate_cool probably will initially consist of multiple kernel launches, it will be significantly faster if all the logic is completed at once.1
-
Streams: CUDA/HIP introduce the concepts of streams to organize work. Strictly speaking, launching a kernel allows you to specify a stream.
- if no stream is specified, kernels are launched on the default stream.
- kernels in a given stream are executed in the order that they were launched.
- While simpler GPU programs generally just use the default stream (indeed, that's what Cholla does -- there hasn't been a compelling reason to explore multiple streams since it runs unigrid simulations), I think I've heard that some more sophisticated simulation codes use multiple streams
- IMPORTANT: If a single CPU thread is driving multiple GPUs, then each GPU will exist be accessed with a separate stream
- ASIDE: the terminology is slightly different in SYCL (relevant for Intel GPUs). In that case, kernels are submitted to SYCL command queue. (A SYCL command queue essentially plays the same role as a CUDA/HIP stream)
API Constraints
chemistry_data_storage constraints
For Grackle to effectively support GPUs, all of the rates tracked by chemistry_data_storage should generally be stored in GPU memory. This has important downstream consequences.
Supporting codes where a single thread drives multiple GPUs
While this isn't something we immediately need to worry about supporting, I think we definitely want to design our API so that its easy to eventually support. It's totally plausible to me that codes might start to become quite important.
To support this, we at the very least need to make the GPU device ids to the function that initializes chemistry_data_storage (see below for for related points)
Retain the ability to use Kokkos within Grackle
I'm not saying that we need to implement GPU support with Kokkos, but I think it's VERY important for us to retain the capability to add support for using GPUs for the foreseeable future.2. The proper way to do this requires us to have access to the stream objects at the time that we initialize chemistry_data_storage.
Reusing data in chemistry_data_storage for multiple streams?
So there are multiple different kinds of device-memory. The relevant memory spaces are global memory, texture memory, and constant memory. The goal is to try to store all rate data of chemistry_data_storage in global memory, since that's the easiest thing to do. In that scenario, I think the idiomatic thing to do is to allocate separate copies of the data for each stream. It's probably worth reading more about texture memory and constant memory. As I understand it, there isn't much benefit to using texture memory anymore and constant memory is generally very small.
What does this all mean?
The simplest way to do all this is probably to make it possible to register a cudaStream_t, hipStream_t, or sycl::queue (from <cuda_runtime.h>, <hip/hip_runtime.h>, or <sycl/sycl.hpp>) within chemistry_data:
- frankly, I don't love that at all. First, it's a little inelegant that these are only supported on a conditional basis (but that's unavoidable). Moreover, I don't love the idea of tracking backend-specific runtime parameters alongside the chemistry-solver configuration (which should generally be portable across parallelism backends) in the same object (plus all the other parameters can be encoded as strings). But, if we are implementing a C API, and we want to avoid making it overly large/complex, this is probably the simplest path...
- to be clear, Grackle should have sensible behavior if these aren't provided. Specifically we should use the default cuda/hip stream. For intel GPUs, I think the default context is the equivalent.
- I think we probably want to store this information in an opaque struct. For the 4.0 release, I was already thinking that
chemistry_data (or equivalent -- if we rename) should become opaque anyway.
- it's noteworthy that the sycl runtime library requires the use of C++. While the CUDA/HIP runtimes allow the use of regular C, I'm EXTREMELY skeptical that anybody who wants this level of control would be compiling a project without any C++ (If you want this level of control, you're almost certainly writing your own kernels). Thus, it's probably fine to require C++ to customize this particular stuff...
- the fact that these types are externally defined adds another wrinkle. Historically, we have gone to great lengths to avoid adding include-directives to
"grackle.h".
- If we had include-directives in that file for
<cuda_runtime.h>, <hip/hip_runtime.h>, or <sycl/sycl.hpp> we would need to make sure that the appropriate include-directives were passed to the compiler. This is straight-forward if the downstream code is built with CMake or finds grackle-config info with pkg-config, but gets very messy if people are including Grackle the classic way.
- Maybe we can define a separate header that unlocks this functionality (and only officially support it if people link grackle via cmake/pkg-config). Perhaps we could ship conditionally
grackle/cuda.h, grackle/hip.h, grackle/sycl.hpp. Or maybe it would be better to just provide grackle-gpu.hpp and adjust its contents based on the way grackle was configured?
- Alternatively, maybe we can get away with forward declaring the types aliased by
cudaStream_t/hipStream_t without actually including the headers of the runtime framework?
I'm a little skeptical
I was doing some reading, and I came across some interesting information that constrains the design of Grackle's API. I figured it would be good if we started to record this information.
Irrelevance for the initial port
A lot of this discussion isn't very relevant for our initial porting efforts. It's going to take some time to get the full implementation of
solve_rate_coolconverted so that it can run as a single kernel. While we work towards that goal, the API won't change much, if at all (we assume that external data is provided on the CPU, and we internally transfer it to and from the GPU).Useful Basic Concepts
For developers that are less experienced with gpu programming, its instructive to briefly review two (related) concepts:
Asynchronous Kernel Execution: All work down by GPUs is performed in compute kernels that are asynchronously run
solve_rate_coolprobably will initially consist of multiple kernel launches, it will be significantly faster if all the logic is completed at once.1Streams: CUDA/HIP introduce the concepts of streams to organize work. Strictly speaking, launching a kernel allows you to specify a stream.
API Constraints
chemistry_data_storageconstraintsFor Grackle to effectively support GPUs, all of the rates tracked by
chemistry_data_storageshould generally be stored in GPU memory. This has important downstream consequences.Supporting codes where a single thread drives multiple GPUs
While this isn't something we immediately need to worry about supporting, I think we definitely want to design our API so that its easy to eventually support. It's totally plausible to me that codes might start to become quite important.
To support this, we at the very least need to make the GPU device ids to the function that initializes
chemistry_data_storage(see below for for related points)Retain the ability to use
Kokkoswithin GrackleI'm not saying that we need to implement GPU support with Kokkos, but I think it's VERY important for us to retain the capability to add support for using GPUs for the foreseeable future.2. The proper way to do this requires us to have access to the stream objects at the time that we initialize
chemistry_data_storage.Reusing data in
chemistry_data_storagefor multiple streams?So there are multiple different kinds of device-memory. The relevant memory spaces are global memory, texture memory, and constant memory. The goal is to try to store all rate data of
chemistry_data_storagein global memory, since that's the easiest thing to do. In that scenario, I think the idiomatic thing to do is to allocate separate copies of the data for each stream. It's probably worth reading more about texture memory and constant memory. As I understand it, there isn't much benefit to using texture memory anymore and constant memory is generally very small.What does this all mean?
The simplest way to do all this is probably to make it possible to register a
cudaStream_t,hipStream_t, orsycl::queue(from<cuda_runtime.h>,<hip/hip_runtime.h>, or<sycl/sycl.hpp>) withinchemistry_data:chemistry_data(or equivalent -- if we rename) should become opaque anyway."grackle.h".<cuda_runtime.h>,<hip/hip_runtime.h>, or<sycl/sycl.hpp>we would need to make sure that the appropriate include-directives were passed to the compiler. This is straight-forward if the downstream code is built with CMake or finds grackle-config info with pkg-config, but gets very messy if people are including Grackle the classic way.grackle/cuda.h,grackle/hip.h,grackle/sycl.hpp. Or maybe it would be better to just providegrackle-gpu.hppand adjust its contents based on the way grackle was configured?cudaStream_t/hipStream_twithout actually including the headers of the runtime framework?I'm a little skeptical
Footnotes
If the logic of
solve_rate_coolis split across multiple kernel launches, we will essentially need to launch one or more kernels per subcycle (there really isn't any other way to do it). This is primarily an issue because we don't know how many subcycles we need ahead of time. Thus, after each subcycle, we need to transfer data from the GPU and check whether we need to launch kernels for another subcycle -- this will be quite slow. ↩I think this is important because the Kokkos project has a lot more manpower than us. This might be the easiest way for us to support more exotic architectures/frameworks (e.g. Intel GPUs or OpenMP-Offload). Plus it's plausible that they may be a lot more on top of taking advantage of new features ↩