@@ -364,6 +364,110 @@ Unpoisoning may not be an option, if (for example) you are not maintaining the a
364
364
* You are using allocator, which does not call destructor during deallocation.
365
365
* You are aware that memory allocated with an allocator may be accessed, even when unused by container.
366
366
367
+ Offloading C++ Parallel Algorithms to GPUs
368
+ ------------------------------------------
369
+
370
+ Experimental support for GPU offloading has been added to ``libc++ ``. The
371
+ implementation uses OpenMP target offloading to leverage GPU compute resources.
372
+ The OpenMP PSTL backend can target both NVIDIA and AMD GPUs.
373
+ However, the implementation only supports contiguous iterators, such as
374
+ iterators for ``std::vector `` or ``std::array ``.
375
+ To enable the OpenMP offloading backend it must be selected with
376
+ ``LIBCXX_PSTL_BACKEND=openmp `` when installing ``libc++ ``. Further, when
377
+ compiling a program, the user must specify the command line options
378
+ ``-fopenmp -fexperimental-library ``. To install LLVM with OpenMP offloading
379
+ enabled, please read
380
+ `the LLVM OpenMP FAQ. <https://openmp.llvm.org/SupportAndFAQ.html >`_
381
+ You may also want to to visit
382
+ `the OpenMP offloading command-line argument reference. <https://openmp.llvm.org/CommandLineArgumentReference.html#offload-command-line-arguments >`_
383
+
384
+ Example
385
+ ~~~~~~~
386
+
387
+ The following is an example of offloading vector addition to a GPU using our
388
+ standard library extension. It implements the classical vector addition from
389
+ BLAS that overwrites the vector ``y `` with ``y=a*x+y ``. Thus ``y.begin() `` is
390
+ both used as an input and an output iterator in this example.
391
+
392
+ .. code-block :: cpp
393
+
394
+ #include <algorithm>
395
+ #include <execution>
396
+
397
+ template <typename T1, typename T2, typename T3>
398
+ void axpy(const T1 a, const std::vector<T2> &x, std::vector<T3> &y) {
399
+ std::transform(std::execution::par_unseq, x.begin(), x.end(), y.begin(),
400
+ y.begin(), [=](T2 xi, T3 yi) { return a * xi + yi; });
401
+ }
402
+
403
+ The execution policy ``std::execution::par_unseq `` states that the algorithm's
404
+ execution may be parallelized, vectorized, and migrated across threads. This is
405
+ the only execution mode that is safe to offload to GPUs, and for all other
406
+ execution modes the algorithms will execute on the CPU.
407
+ Special attention must be paid to the lambda captures when enabling GPU
408
+ offloading. If the lambda captures by reference, the user must manually map the
409
+ variables to the device. If capturing by reference, the above example could
410
+ be implemented in the following way.
411
+
412
+ .. code-block :: cpp
413
+
414
+ template <typename T1, typename T2, typename T3>
415
+ void axpy(const T1 a, const std::vector<T2> &x, std::vector<T3> &y) {
416
+ #pragma omp target data map(to : a)
417
+ std::transform(std::execution::par_unseq, x.begin(), x.end(), y.begin(),
418
+ y.begin(), [&](T2 xi, T3 yi) { return a * xi + yi; });
419
+ }
420
+
421
+ However, if unified shared memory, USM, is enabled, no additional data mapping
422
+ is necessary when capturing y reference.
423
+
424
+ Compiling functions for GPUs with OpenMP
425
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
426
+
427
+ The C++ standard defines that all accesses to memory are inside a single address
428
+ space. However, discrete GPU systems have distinct address spaces. A single
429
+ address space can be emulated if your system supports unified shared memory.
430
+ However, many discrete GPU systems do not, and in those cases it is important to
431
+ pass device function pointers to the parallel algorithms. Below is an example of
432
+ how the OpenMP ``declare target `` directive with the ``indirect `` clause can be
433
+ used to mark that a function should be compiled for both host and device.
434
+
435
+ .. code-block :: cpp
436
+
437
+ // This function computes the squared difference of two floating points
438
+ float squared(float a, float b) { return a * a - 2.0f * a * b + b * b; };
439
+
440
+ // Declare that the function must be compiled for both host and device
441
+ #pragma omp declare target indirect to(squared)
442
+
443
+ int main() {
444
+ std::vector<float> a(100, 1.0);
445
+ std::vector<float> b(100, 1.25);
446
+
447
+ // Pass the host function pointer to the parallel algorithm and let OpenMP
448
+ // translate it to the device function pointer internally
449
+ float sum =
450
+ std::transform_reduce(std::execution::par_unseq, a.begin(), a.end(),
451
+ b.begin(), 0.0f, std::plus{}, squared);
452
+
453
+ // Validate that the result is approximately 6.25
454
+ assert(std::abs(sum - 6.25f) < 1e-10);
455
+ return 0;
456
+ }
457
+
458
+ Without unified shared memory, the above example will not work if the host
459
+ function pointer ``squared `` is passed to the parallel algorithm.
460
+
461
+ Important notes about exception handling
462
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
463
+
464
+ GPU architectures do not support exception handling. If compiling a program
465
+ containing parallel algorithms with current versions of Clang, a program with
466
+ exceptions in offloaded code regions will compile, but the program will
467
+ terminate if an exception is thrown on the device. This does not conform with
468
+ the C++ standard and exception handling on GPUs will hopefully be better
469
+ supported in future releases of LLVM.
470
+
367
471
Platform specific behavior
368
472
==========================
369
473
0 commit comments