Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Move histogram CPU backend implementation to Supported #2015

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ they can select from when using oneDPL. It is important that all combinations of
`histogram` APIs.

### Performance
Histogram algorithms typically involve minimal computation and are likely to be memory-bound. So, the implementation prioritizes
reducing memory accesses and minimizing temporary memory traffic.
Histogram algorithms typically involve minimal computation and are likely to be memory-bound. So, the implementation
prioritizes reducing memory accesses and minimizing temporary memory traffic.

For CPU backends, we will focus on input sizes ranging from 32K to 4M elements and 32 - 4k histogram bins. Smaller sizes
of input may best be suited for serial histogram implementation, and very large sizes may be better suited for GPU
Expand Down Expand Up @@ -138,38 +138,96 @@ percentage of the output histogram may be occupied, even when considering dividi
This could be explored if we find temporary storage is too large for some cases and the atomic approach does not
provide a good fallback.

## Proposal
After exploring the above implementation for `histogram`, the following proposal better represents the use cases which
are important, and provides reasonable performance for most cases.
## Supported Implementation
After exploring the above implementation for `histogram`, the following implementation better represents the use cases
which are important, and provides reasonable performance for most cases.

### Version Information
Histogram APIs are supported for host execution policies as of version 2022.8.0 of oneDPL, specification version 1.4.
```
ONEDPL_VERSION_MAJOR=2022
ONEDPL_VERSION_MINOR=8
ONEDPL_VERSION_PATCH=0

ONEDPL_SPEC_VERSION=104
```

Histogram APIs are supported for device execution policies as of version 2022.5.0. It was first specified in oneDPL
specification version 1.4. However, this was released before `ONEDPL_SPEC_VERSION` was defined within our implementation
of oneDPL.
```
ONEDPL_VERSION_MAJOR=2022
ONEDPL_VERSION_MINOR=5
ONEDPL_VERSION_PATCH=0
```

### Embarrassingly Parallel Via Temporary Histograms
This method uses temporary storage and a pair of calls to backend specific `parallel_for` functions to accomplish the
`histogram`. These calls will use the existing infrastructure to provide properly composable parallelism, without extra
histogram-specific patterns in the implementation of a backend.

This algorithm does however require that each parallel backend will add a
`__enumerable_thread_local_storage<_StoredType>` struct which provides the following:
* constructor which takes a variadic list of args to pass to the constructor of each thread's object
* `get_for_current_thread()` returns reference to the current thread's stored object
* `get_with_id(int i)` returns reference to the stored object for an index
* `size()` returns number of stored objects

In the TBB backend, this will use `enumerable_thread_specific` internally. For OpenMP, we implement our own similar
thread local storage which will allocate and initialize the thread local storage at the first usage for each active
thread, similar to TBB. The serial backend will merely create a single copy of the temporary object for use. The serial
backend does not technically need any thread specific storage, but to avoid special casing for this serial backend, we
use a single copy of histogram. In practice, our benchmarking reports little difference in performance between this
implementation and the original, which directly accumulated to the output histogram.

With this new structure we will use the following algorithm:

1) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
its own temporary histogram returned by `__enumerable_thread_local_storage`. The parallelism is divided on the input
element axis, and we rely upon existing `parallel_for` to implement chunksize and thread composability.
2) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
histogram created within `__enumerable_thread_local_storage` into the output histogram sequence. The parallelism is
divided on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the
output histogram.

#### Enumerable Thread Local Storage feature
This algorithm requires that each parallel backend add a "make" function with the following signature:
```
template <typename _ValueType, typename... _Args>
/* unspecified enumerable thread local storage type */
oneapi::dpl::__par_backend::__make_enumerable_tls(_Args&&)
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved
```
returning an object which represents an enumerable thread local storage. The enumerable thread local storage object
akukanov marked this conversation as resolved.
Show resolved Hide resolved
must provide the following interfaces:
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved

* `__ValueType& get_for_current_thread()`

Returns reference to the current thread's stored object. This function is
meant to be used within a parallel context, where each thread can access storage exclusive to its thread index. The
thread local storage should be created on first call of this function per thread. This avoids unnecessary overhead
for loops with fewer threads used than maximum.
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved

* `__ValueType& get_with_id(std::size_t __i)`

Returns reference to the stored object for an index of the list of
already created elements, skipping any empty elements. This function must not be called within concurrently with
`get_for_current_thread()`, which may create new thread local storage elements.
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved

* `std::size_t size() const`

Returns number of already created elements. Similar to `get_with_id()`, this must not be
called concurrently with `get_for_current_thread()`, which may change the number of created elements.
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved

A unified implementation of `__enumerable_thread_local_storage` is provided with these features in
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved
`include/oneapi/dpl/pstl/parallel_backend_utils.h` for use as a base implementation in a Curiously Recurring Template
Pattern (CRTP) for individual parallel backends to derive from. When using this base struct, individual parallel
backends must provide the following methods.
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved

* `static std::size_t get_num_threads()`

Returns the number of threads available in the current parallel backend and
context
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved

* `static std::size_t get_thread_num()`

Returns the index of the current thread from within a parallel section
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved

When these two functions are provided by the derived structure, the base implementation provides the functionality
specified above. An example of this is found in both `include/oneapi/dpl/pstl/omp/util.h` and
`include/oneapi/dpl/pstl/parallel_backend_tbb.h`. The serial backend instead defines its own specific
`__enumerable_thread_local_storage` type in `include/oneapi/dpl/pstl/parallel_backend_serial.h`.

We decided to avoid using the class provided by TBB, `enumerable_thread_specific`, because on windows it indirectly
includes "windows.h". The baggage associated with "windows.h" is not worth the benefit of using the existing type.
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved

#### Algorithm
1) Call existing `parallel_for` pattern which performs a `histogram` on the input sequence, each thread accumulating
into its own temporary histogram provided by `__enumerable_thread_local_storage`. We rely upon `parallel_for` pattern to
implement chunksize and thread composability.
1) Call a second `parallel_for` pattern over the `histogram` output sequence which accumulates all temporary copies of
the histogram created within `__enumerable_thread_local_storage` into the output histogram sequence. Threads divide the
histogram bin axis, where each chunk loops through all temporary histograms to accumulate into the section of output
histogram. In this parallel section, the inner loop for each thread iterates over individual histogram bins within the
same temporary histogram copy, accumulating into the corresponding global output bins. This operates on chunks of
contiguous elements in memory, and is a better access pattern taking advantage of cachelines, vectorization, and memory
prefetching.

With the overhead associated with this algorithm, the implementation of each `parallel_for` may fallback to a serial
implementation. It makes sense to include this as part of a future improvement of `parallel_for`, where a user could
Expand All @@ -190,3 +248,9 @@ The proposed algorithm should have `O(N) + O(num_bins)` operations where `N` is
#### Custom Range Bin API
The proposed algorithm should have `O(N * log(num_bins)) + O(num_bins)` operations where `N` is the number of input
elements, and `num_bins` is the number of histogram bins.

### Future work
* Improve parallel_for pattern to allow generic implementation level to specify chunksize and serial cutoff thresholds.
* Explore options for some partial vectorization of the first parallel_for loop
* Explore improvements to `__enumerable_thread_local_storage` API for enumerating through created elements. Consider
switching to a visitor pattern, and better protecting against improper usage.