Skip to content

Commit c025dbd

Browse files
caugonnetandralexmiscco
authored
[STF] reduce access mode (#2830)
* Experiment to start introducing a reduction access mode used in kernels (eg. parallel_for). * Add a trait to count the number of reductions required in a tuple of deps * WIP: create a new scalar<T> interface which can be used in a reduction access mode, and start to implement all the mechanisms for reductions in parallel_for * WIP ! Introduce owning_container_of trait class * WIP: save progress here, lots of hardcoded things and we need to move to cuda::std::tuple * WIP : first prototype working... * Proper initialization of shared memory buffers, and add another example * Some cleanups and renaming of classes for better clarity * clang-format * workaround some false unused captured variable warning * Fix various C++ errors, and do not use the I variable * Rework the CFD example to use reductions, and generalize the transfer_host method * clang-format * Implement transfer_host (name subject to change !) directly in the context * clang-format * Make it possible to either accumulate a reduction result with an existing value, or initialize a new one * Implement a set of predefined reducers * clang-format * move the definition of do_init and no_init * update word count example * Code simplification to facilitate the transition to ::cuda::std::tuple * Use ::cuda::std::tuple for reduction variables * use proper type for the size of buffers * clang-format * remove unused variables * fix buffer size * add missing typename * Add missing typename * Add maybe_unused for variables currently unused in a WIP code * clang-format * add a doxygen comment * Add missing constructors * Code cleanup * remove dead code * task_dep_op_none should just be a tag type, there is no need to implement a fake apply_op operation * Remove dead code * Remove unused template parameter * Slightly simpler count_type trait * clang-format * Add a small unit test to test count_type_v * Do not define both no_init and do_init types anymore, just expose no_init to user, then use true_type and false_type internally. Also rename reduce_do_init to reduce for clarity, as this is the most common case. * sort examples in cmake * clang-format * Simplify redux_vars * Use ::std::monostate instead of EmptyType * Simplify redux_vars * clang-format * Add a missing doxygen comment * Replace 01-axpy-reduce.cu with 09-dot-reduce.cu which is a more meaningful example * clang-format * fix word count example * Minimize copying of dependencies * - Fix how we load data in shared memory during the finalization kernel of the reduction. - Fix errors where block size and grid size was inverted * clang-format * Example to compute pi using Monte Carlo method * Add a unit test to ensure the reduce access mode works * clang-format * Not all ascii chars between A and z are alphanumerical chars * remove dead code * minor cleanups * Not all ascii chars between A and z are alphanumerical chars * no need for type alias when we use it once only * Fix pi test * Move reduction operator and init flag to task_dep, step 1 * Add a new test to check that the scalar interface works as expected (it is currently broken on graphs) * Fully implement the scalar interface * fix potentially uninialized variable warnings * fix unused variable warning * Add a test to ensure we properly deal with empty shapes in parallel_for : it indeed requires to initialize the variable. This is failing currently because we did not implement this initialization. * clang-format * Implement the CUDA kernel for reduction with empty shapes * Move reduction operator and init flag to task_dep, step 2: parallel_for_scope is templated on deps_ops not their arguments * Move reduction operator and init flag to task_dep, step 3: make parallel_for overloads more generic * Fix the finalize kernel if there are more threads than items * clang-format * Implementation of the reduce access mode for CUDA graphs * Test empty shapes with reductions on both stream and graphs * Move reduction operator and init flag to task_dep, step 4: eliminate task_dep_op entirely * clang-format * fix parallel_for on host * Disable nvrtc workaround (#1116) nvbug3961621 has been fixed in 12.2 Addresses nvbug4263883 * Tighten overloading of context::parallel_for * clang-format * Optimize loop function by hoisting lambda definition out of the loop and by using universal references for intermediate lambdas * No need for SelectType * A few more improvements * Fix build * Documentation for reduce() * Improve doc for reduce() * Rename transfer_host in wait * doxygen blocks for reducer operators * Add missing doxygen blocks or make them more accurate * Remove commented code * remove printf * Add sanity checks to detect unimplemented uses of reduce() * Fix a logic error * remove maybe_unused that is not needed anymore * Properly handle reduce on a CUDA graph that is not executed by device 0 * Reimplement pagerank using a reduce access mode * No need to atomicMaxFloat when using a reduce(reducer::maxval<float>{}) * use references in calculating_pagerank * Add a missing doxygen block for scalar<T> * Remove count_type_v and count_type which are not used anymore * replace an atomic add by a reduction * Simpler scalar implementation with a struct * Comment to clarify get_owning_container_of * Remove useless ctor * fix spelling issue * clang-format * Explain how we statically dispatch between the different task_dep(_untyped) constructors if they are read only or not. This is something that could be further improved ! * Do provide constructors for scalar<T> --------- Co-authored-by: Andrei Alexandrescu <andrei@erdani.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
1 parent a67360d commit c025dbd

File tree

28 files changed

+1924
-413
lines changed

28 files changed

+1924
-413
lines changed
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
//===----------------------------------------------------------------------===//
2+
//
3+
// Part of CUDASTF in CUDA C++ Core Libraries,
4+
// under the Apache License v2.0 with LLVM Exceptions.
5+
// See https://llvm.org/LICENSE.txt for license information.
6+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
7+
// SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES.
8+
//
9+
//===----------------------------------------------------------------------===//
10+
11+
/**
12+
* @file
13+
*
14+
* @brief Implementation of the DOT kernel using a reduce access mode
15+
*
16+
*/
17+
18+
#include <cuda/experimental/stf.cuh>
19+
20+
using namespace cuda::experimental::stf;
21+
22+
int main()
23+
{
24+
const size_t N = 16;
25+
double X[N], Y[N];
26+
27+
double ref_res = 0.0;
28+
29+
for (size_t i = 0; i < N; i++)
30+
{
31+
X[i] = cos(double(i));
32+
Y[i] = sin(double(i));
33+
34+
// Compute the reference result of the DOT product of X and Y
35+
ref_res += X[i] * Y[i];
36+
}
37+
38+
context ctx;
39+
auto lX = ctx.logical_data(X);
40+
auto lY = ctx.logical_data(Y);
41+
42+
auto lsum = ctx.logical_data(shape_of<scalar<double>>());
43+
44+
/* Compute sum(x_i * y_i)*/
45+
ctx.parallel_for(lY.shape(), lX.read(), lY.read(), lsum.reduce(reducer::sum<double>{}))
46+
->*[] __device__(size_t i, auto dX, auto dY, double& sum) {
47+
sum += dX(i) * dY(i);
48+
};
49+
50+
double res = ctx.wait(lsum);
51+
52+
ctx.finalize();
53+
54+
_CCCL_ASSERT(fabs(res - ref_res) < 0.0001, "Invalid result");
55+
}

cudax/examples/stf/CMakeLists.txt

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,18 @@ set(stf_example_sources
1616

1717
# Examples which rely on code generation (parallel_for or launch)
1818
set(stf_example_codegen_sources
19-
01-axpy-parallel_for.cu
2019
01-axpy-launch.cu
20+
01-axpy-parallel_for.cu
2121
binary_fhe.cu
22+
09-dot-reduce.cu
2223
cfd.cu
2324
custom_data_interface.cu
25+
fdtd_mgpu.cu
2426
frozen_data_init.cu
2527
graph_algorithms/degree_centrality.cu
28+
graph_algorithms/jaccard.cu
2629
graph_algorithms/pagerank.cu
2730
graph_algorithms/tricount.cu
28-
graph_algorithms/jaccard.cu
29-
fdtd_mgpu.cu
3031
heat.cu
3132
heat_mgpu.cu
3233
jacobi.cu
@@ -35,11 +36,13 @@ set(stf_example_codegen_sources
3536
launch_sum.cu
3637
launch_sum_cub.cu
3738
logical_gates_composition.cu
39+
mandelbrot.cu
3840
parallel_for_2D.cu
41+
pi.cu
3942
scan.cu
40-
mandelbrot.cu
4143
standalone-launches.cu
4244
word_count.cu
45+
word_count_reduce.cu
4346
)
4447

4548
# Examples using CUBLAS, CUSOLVER...

cudax/examples/stf/cfd.cu

Lines changed: 20 additions & 142 deletions
Original file line numberDiff line numberDiff line change
@@ -188,86 +188,18 @@ void jacobistepvort(
188188
};
189189
}
190190

191-
template <typename T>
192-
T transfer_host(context& ctx, logical_data<slice<T>>& ldata)
191+
double deltasq(context& ctx, logical_data<slice<double, 2>> lnewarr, logical_data<slice<double, 2>> loldarr)
193192
{
194-
T out;
193+
auto ldsq = ctx.logical_data(shape_of<scalar<double>>()).set_symbol("tmp_accumulator");
195194

196-
bool is_graph = ctx.is_graph_ctx();
197-
198-
if (is_graph)
199-
{
200-
ctx.host_launch(ldata.read()).set_symbol("transfer_host")->*[&](auto data) {
201-
out = data(0);
202-
};
203-
204-
/* This forces the completion of the host callback, so that the host
205-
* thread can use the content for dynamic control flow */
206-
cudaStreamSynchronize(ctx.task_fence());
207-
}
208-
else
209-
{
210-
ctx.task(exec_place::host, ldata.read()).set_symbol("transfer_host")->*[&](cudaStream_t stream, auto data) {
211-
cuda_safe_call(cudaStreamSynchronize(stream));
212-
out = data(0);
213-
};
214-
}
215-
216-
return out;
217-
}
218-
219-
double
220-
deltasq(context& ctx, logical_data<slice<double, 2>> lnewarr, logical_data<slice<double, 2>> loldarr, int m, int n)
221-
{
222-
auto ldsq = ctx.logical_data(shape_of<slice<double>>({1})).set_symbol("tmp_accumulator");
223-
224-
//
225-
// for (i = 1; i <= m; i++) {
226-
// for (j = 1; j <= n; j++) {
227-
// double tmp = newarr[i * (m + 2) + j] - oldarr[i * (m + 2) + j];
228-
// dsq += tmp * tmp;
229-
// }
230-
// }
231-
232-
auto spec = con(con<128>(hw_scope::thread));
233-
ctx.launch(spec, ldsq.write(), lnewarr.read(), loldarr.read()).set_symbol("deltasq")->*
234-
[m, n] __device__(auto th, auto dsq, auto newarr, auto oldarr) {
235-
if (th.rank() == 0)
236-
{
237-
dsq(0) = 0.0;
238-
}
239-
th.sync();
240-
241-
// Each thread computes the sum of elements assigned to it
242-
double local_sum = 0.0;
243-
for (auto [i, j] :
244-
th.apply_partition(box<2>({1, m + 1}, {1, n + 1}), std::tuple<blocked_partition, cyclic_partition>()))
245-
{
246-
double tmp = newarr(i, j) - oldarr(i, j);
247-
local_sum += tmp * tmp;
248-
}
249-
250-
auto ti = th.inner();
251-
252-
__shared__ double block_sum[th.static_width(1)];
253-
block_sum[ti.rank()] = local_sum;
254-
255-
for (size_t s = ti.size() / 2; s > 0; s /= 2)
256-
{
257-
if (ti.rank() < s)
258-
{
259-
block_sum[ti.rank()] += block_sum[ti.rank() + s];
260-
}
261-
ti.sync();
262-
}
263-
264-
if (ti.rank() == 0)
265-
{
266-
atomicAdd(&dsq(0), block_sum[0]);
267-
}
268-
};
195+
ctx.parallel_for(lnewarr.shape(), ldsq.reduce(reducer::sum<double>{}), lnewarr.read(), loldarr.read())
196+
.set_symbol("deltasq")
197+
->*[] __device__(size_t i, size_t j, auto& dsq, auto newarr, auto oldarr) {
198+
double tmp = newarr(i, j) - oldarr(i, j);
199+
dsq += tmp * tmp;
200+
};
269201

270-
return transfer_host(ctx, ldsq);
202+
return ctx.wait(ldsq);
271203
}
272204

273205
void boundarypsi(context& ctx, logical_data<slice<double, 2>> lpsi, int m, int /*n*/, int b, int h, int w)
@@ -422,44 +354,14 @@ int main(int argc, char** argv)
422354
boundarypsi(ctx, lpsi, m, n, b, h, w);
423355

424356
// compute normalisation factor for error
425-
auto lbnorm = ctx.logical_data(shape_of<slice<double>>({1})).set_symbol("bnorm");
357+
auto lbnorm = ctx.logical_data(shape_of<scalar<double>>()).set_symbol("bnorm");
426358

427359
nvtxRangePush("Compute_Normalization");
428360

429-
// bnorm += psi * psi
430-
auto spec = con(con<32>());
431-
ctx.launch(spec, lbnorm.write(), lpsi.read()).set_symbol("Compute_Normalization")
432-
->*[] __device__(auto th, auto bnorm, auto psi) {
433-
if (th.rank() == 0)
434-
{
435-
bnorm(0) = 0.0;
436-
}
437-
th.sync();
438-
// Each thread computes the sum of elements assigned to it
439-
double local_sum = 0.0;
440-
for (auto [i, j] : th.apply_partition(shape(psi)))
441-
{
442-
local_sum += psi(i, j) * psi(i, j);
443-
}
444-
445-
auto ti = th.inner();
446-
447-
__shared__ double block_sum[th.static_width(1)];
448-
block_sum[ti.rank()] = local_sum;
449-
450-
for (size_t s = ti.size() / 2; s > 0; s /= 2)
451-
{
452-
if (ti.rank() < s)
453-
{
454-
block_sum[ti.rank()] += block_sum[ti.rank() + s];
455-
}
456-
ti.sync();
457-
}
458-
459-
if (ti.rank() == 0)
460-
{
461-
atomicAdd(&bnorm(0), block_sum[0]);
462-
}
361+
// bnorm = psi * psi
362+
ctx.parallel_for(lpsi.shape(), lpsi.read(), lbnorm.reduce(reducer::sum<double>{}))
363+
->*[] __device__(size_t i, size_t j, auto psi, auto& bnorm) {
364+
bnorm += psi(i, j) * psi(i, j);
463365
};
464366

465367
if (!irrotational)
@@ -468,37 +370,13 @@ int main(int argc, char** argv)
468370
boundaryzet(ctx, lzet, lpsi, m, n);
469371

470372
// update normalisation
471-
ctx.launch(spec, lbnorm.rw(), lzet.read()).set_symbol("Compute_Normalization")
472-
->*[] __device__(auto th, auto bnorm, auto zet) {
473-
// Each thread computes the sum of elements assigned to it
474-
double local_sum = 0.0;
475-
for (auto [i, j] : th.apply_partition(shape(zet)))
476-
{
477-
local_sum += zet(i, j) * zet(i, j);
478-
}
479-
480-
auto ti = th.inner();
481-
482-
__shared__ double block_sum[th.static_width(1)];
483-
block_sum[ti.rank()] = local_sum;
484-
485-
for (size_t s = ti.size() / 2; s > 0; s /= 2)
486-
{
487-
if (ti.rank() < s)
488-
{
489-
block_sum[ti.rank()] += block_sum[ti.rank() + s];
490-
}
491-
ti.sync();
492-
}
493-
494-
if (ti.rank() == 0)
495-
{
496-
atomicAdd(&bnorm(0), block_sum[0]);
497-
}
373+
ctx.parallel_for(lzet.shape(), lzet.read(), lbnorm.reduce(reducer::sum<double>{}, no_init{}))
374+
->*[] __device__(size_t i, size_t j, auto zet, auto& bnorm_zet) {
375+
bnorm_zet += zet(i, j) * zet(i, j);
498376
};
499377
}
500378

501-
double bnorm = transfer_host(ctx, lbnorm);
379+
double bnorm = ctx.wait(lbnorm);
502380
bnorm = sqrt(bnorm);
503381

504382
// begin iterative Jacobi loop
@@ -525,11 +403,11 @@ int main(int argc, char** argv)
525403
bool compute_error = (iter == numiter) || (checkerr && (iter % printfreq == 0));
526404
if (compute_error)
527405
{
528-
error = deltasq(ctx, lpsitmp, lpsi, m, n);
406+
error = deltasq(ctx, lpsitmp, lpsi);
529407

530408
if (!irrotational)
531409
{
532-
error += deltasq(ctx, lzettmp, lzet, m, n);
410+
error += deltasq(ctx, lzettmp, lzet);
533411
}
534412

535413
error = sqrt(error);

cudax/examples/stf/graph_algorithms/pagerank.cu

Lines changed: 16 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -21,22 +21,6 @@
2121

2222
using namespace cuda::experimental::stf;
2323

24-
/**
25-
* @brief Performs an atomic maximum operation on floating-point numbers by reinterpreting them as integers.
26-
*
27-
* @param address Pointer to the float value that will be updated.
28-
* @param val The float value to compare and possibly set at the address.
29-
* @return The old value at the address (reinterpreted as a float).
30-
*/
31-
__device__ float atomicMaxFloat(float* address, float val)
32-
{
33-
int* address_as_int = (int*) address;
34-
int old = *address_as_int;
35-
int new_val = __float_as_int(val);
36-
atomicMax(address_as_int, new_val);
37-
return __int_as_float(old);
38-
}
39-
4024
/**
4125
* @brief Calculates the PageRank for a given vertex.
4226
*
@@ -49,10 +33,10 @@ __device__ float atomicMaxFloat(float* address, float val)
4933
*/
5034
__device__ void calculating_pagerank(
5135
int idx,
52-
slice<const int> loffsets,
53-
slice<const int> lnonzeros,
54-
slice<const float> lpage_rank,
55-
slice<float> lnew_page_rank,
36+
const slice<const int>& loffsets,
37+
const slice<const int>& lnonzeros,
38+
const slice<const float>& lpage_rank,
39+
slice<float>& lnew_page_rank,
5640
float init_rank)
5741
{
5842
float rank_sum = 0.0;
@@ -77,7 +61,6 @@ int main()
7761
int num_vertices = offsets.size() - 1;
7862
float init_rank = 1.0f / num_vertices;
7963
float tolerance = 1e-6f;
80-
float max_diff = 0.0f;
8164
int NITER = 100;
8265

8366
// output pageranks for each vertex
@@ -88,34 +71,26 @@ int main()
8871
auto lnonzeros = ctx.logical_data(&nonzeros[0], nonzeros.size());
8972
auto lpage_rank = ctx.logical_data(&page_rank[0], page_rank.size());
9073
auto lnew_page_rank = ctx.logical_data(&new_page_rank[0], new_page_rank.size());
91-
auto lmax_diff = ctx.logical_data(&max_diff, {1});
74+
auto lmax_diff = ctx.logical_data(shape_of<scalar<float>>());
9275

9376
for (int iter = 0; iter < NITER; ++iter)
9477
{
9578
// Calculate Current Iteration PageRank
96-
ctx.parallel_for(box(num_vertices), loffsets.read(), lnonzeros.read(), lpage_rank.rw(), lnew_page_rank.rw())
97-
->*[init_rank] __device__(size_t idx, auto loffsets, auto lnonzeros, auto lpage_rank, auto lnew_page_rank) {
79+
ctx.parallel_for(
80+
box(num_vertices),
81+
loffsets.read(),
82+
lnonzeros.read(),
83+
lpage_rank.rw(),
84+
lnew_page_rank.rw(),
85+
lmax_diff.reduce(reducer::maxval<float>{}))
86+
->*[init_rank] __device__(
87+
size_t idx, auto loffsets, auto lnonzeros, auto lpage_rank, auto lnew_page_rank, auto& max_diff) {
9888
calculating_pagerank(idx, loffsets, lnonzeros, lpage_rank, lnew_page_rank, init_rank);
99-
};
100-
101-
// Calculate Current Iteration Error
102-
ctx.parallel_for(box(1), lmax_diff.write())->*[] __device__(size_t, auto lmax_diff) {
103-
lmax_diff(0) = 0.0f;
104-
};
105-
106-
// Calculate Current Iteration Error
107-
ctx.parallel_for(box(num_vertices), lpage_rank.read(), lnew_page_rank.read(), lmax_diff.rw())
108-
->*[] __device__(size_t idx, auto lpage_rank, auto lnew_page_rank, auto lmax_diff) {
109-
atomicMaxFloat(lmax_diff.data_handle(), fabs(lnew_page_rank[idx] - lpage_rank[idx]));
89+
max_diff = ::std::max(max_diff, lnew_page_rank[idx] - lpage_rank[idx]);
11090
};
11191

11292
// Reduce Error and Check for Convergence
113-
bool converged;
114-
ctx.task(exec_place::host, lmax_diff.read())->*[tolerance, &converged](cudaStream_t s, auto max_diff) {
115-
cuda_safe_call(cudaStreamSynchronize(s));
116-
converged = (max_diff(0) < tolerance);
117-
};
118-
93+
bool converged = (ctx.wait(lmax_diff) < tolerance);
11994
if (converged)
12095
{
12196
break;

0 commit comments

Comments
 (0)