GPU-constraints on API

I was doing some reading, and I came across some interesting information that constrains the design of Grackle's API. I figured it would be good if we started to record this information.

## Irrelevance for the initial port

A lot of this discussion isn't very relevant for our initial porting efforts. It's going to take some time to get the full implementation of `solve_rate_cool` converted so that it can run as a single kernel. While we work towards that goal, the API won't change much, if at all (we assume that external data is provided on the CPU, and we internally transfer it to and from the GPU).

## Useful Basic Concepts

For developers that are less experienced with gpu programming, its instructive to briefly review two (related) concepts:

1. **Asynchronous Kernel Execution**: All work down by GPUs is performed in compute kernels that are asynchronously run
    - After we launch a kernel, we are informed whether the launch was successful. The simplest thing to do is to then wait around until the kernel completes and then we could launch the next kernel.
    - Alternatively, you can launch a sequence of kernels in a row and then wait for all those kernels to finish. This pipelining minimizes the amount of time the program spends waiting for communication between the CPU and GPU. (Most GPU-based simulation codes end up doing this)
    - ASIDE: while our initial porting efforts for `solve_rate_cool` probably will initially consist of multiple kernel launches, it will be significantly faster if all the logic is completed at once.[^separate-subcycle-launches]

2. **Streams**: CUDA/HIP introduce the concepts of streams to organize work. Strictly speaking, launching a kernel allows you to specify a stream.
   - if no stream is specified, kernels are launched on the default stream.
   - kernels in a given stream are executed in the order that they were launched.
   - While simpler GPU programs generally just use the default stream (indeed, that's what Cholla does -- there hasn't been a compelling reason to explore multiple streams since it runs unigrid simulations), I think I've heard that some more sophisticated simulation codes use multiple streams
   - IMPORTANT: If a single CPU thread is driving multiple GPUs, then each GPU will exist be accessed with a separate stream 
   - ASIDE: the terminology is slightly different in SYCL (relevant for Intel GPUs). In that case, kernels are submitted to SYCL command queue. (A SYCL command queue essentially plays the same role as a CUDA/HIP stream)

# API Constraints

## `chemistry_data_storage` constraints

For Grackle to effectively support GPUs, all of the rates tracked by `chemistry_data_storage` should generally be stored in GPU memory. This has important downstream consequences.

### Supporting codes where a single thread drives multiple GPUs

While this isn't something we immediately need to worry about supporting, I think we definitely want to design our API so that its easy to eventually support. It's totally plausible to me that codes might start to become quite important.

To support this, we at the very least need to make the GPU device ids to the function that initializes `chemistry_data_storage` (see below for for related points) 

### Retain the ability to use `Kokkos` within Grackle

I'm not saying that we need to implement GPU support with Kokkos, but I think it's **VERY** important for us to retain the capability to add support for using GPUs for the foreseeable future.[^why-retain-Kokkos-support-capability]. The [proper way to do this](https://kokkos.org/kokkos-core-wiki/API/core/MultiGPUSupport.html) requires us to have access to the stream objects at the time that we initialize `chemistry_data_storage`.

### Reusing data in `chemistry_data_storage` for multiple streams?

So there are multiple different kinds of device-memory. The relevant memory spaces are **global memory**, **texture memory**, and **constant memory**. The goal is to try to store all rate data of `chemistry_data_storage` in global memory, since that's the easiest thing to do. In that scenario, I think the idiomatic thing to do is to allocate separate copies of the data for each stream. It's probably worth reading more about texture memory and constant memory. As I understand it, there isn't much benefit to using texture memory anymore and constant memory is generally very small.

### What does this all mean?

The simplest way to do all this is probably to make it possible to register a `cudaStream_t`, `hipStream_t`, or `sycl::queue` (from `<cuda_runtime.h>`, `<hip/hip_runtime.h>`, or `<sycl/sycl.hpp>`) within `chemistry_data`:
- frankly, I don't love that at all. First, it's a little inelegant that these are only supported on a conditional basis (but that's unavoidable). Moreover, I don't love the idea of tracking backend-specific runtime parameters alongside the chemistry-solver configuration (which should generally be portable across parallelism backends) in the same object (plus all the other parameters can be encoded as strings). But, if we are implementing a C API, and we want to avoid making it overly large/complex, this is probably the simplest path...
- to be clear, Grackle should have sensible behavior if these aren't provided. Specifically we should use the default cuda/hip stream. For intel GPUs, I think the default context is the equivalent.
- I think we probably want to store this information in an opaque struct. For the 4.0 release, I was already thinking that `chemistry_data` (or equivalent -- if we rename) should become opaque anyway. 
- it's noteworthy that the sycl runtime library requires the use of C++. While the CUDA/HIP runtimes allow the use of regular C, I'm EXTREMELY skeptical that anybody who wants this level of control would be compiling a project without any C++ (If you want this level of control, you're almost certainly writing your own kernels). Thus, it's probably fine to require C++ to customize this particular stuff...
- the fact that these types are externally defined adds another wrinkle. Historically, we have gone to great lengths to avoid adding include-directives to `"grackle.h"`.
  - If we had include-directives in that file for `<cuda_runtime.h>`, `<hip/hip_runtime.h>`, or `<sycl/sycl.hpp>` we would need to make sure that the appropriate include-directives were passed to the compiler. This is straight-forward if the downstream code is built with CMake or finds grackle-config info with pkg-config, but gets very messy if people are including Grackle the classic way.
  - Maybe we can define a separate header that unlocks this functionality (and only officially support it if people link grackle via cmake/pkg-config). Perhaps we could ship conditionally `grackle/cuda.h`, `grackle/hip.h`, `grackle/sycl.hpp`. Or maybe it would be better to just provide `grackle-gpu.hpp` and adjust its contents based on the way grackle was configured?
  - Alternatively, maybe we can get away with forward declaring the types aliased by `cudaStream_t`/`hipStream_t` without actually including the headers of the runtime framework?

I'm a little skeptical 

[^separate-subcycle-launches]: If the logic of `solve_rate_cool` is split across multiple kernel launches, we will essentially need to launch one or more kernels per subcycle (there really isn't any other way to do it). This is primarily an issue because we don't know how many subcycles we need ahead of time. Thus, after each subcycle, we need to transfer data from the GPU and check whether we need to launch kernels for another subcycle -- this will be quite slow.
[^why-retain-Kokkos-support-capability]: I think this is important because the Kokkos project has a lot more manpower than us. This might be the easiest way for us to support more exotic architectures/frameworks (e.g. Intel GPUs or OpenMP-Offload). Plus it's plausible that they may be a lot more on top of taking advantage of new features

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-constraints on API #532

Irrelevance for the initial port

Useful Basic Concepts

API Constraints

`chemistry_data_storage` constraints

Supporting codes where a single thread drives multiple GPUs

Retain the ability to use `Kokkos` within Grackle

Reusing data in `chemistry_data_storage` for multiple streams?

What does this all mean?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU-constraints on API #532

Description

Irrelevance for the initial port

Useful Basic Concepts

API Constraints

chemistry_data_storage constraints

Supporting codes where a single thread drives multiple GPUs

Retain the ability to use Kokkos within Grackle

Reusing data in chemistry_data_storage for multiple streams?

What does this all mean?

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`chemistry_data_storage` constraints

Retain the ability to use `Kokkos` within Grackle

Reusing data in `chemistry_data_storage` for multiple streams?