Commit c025dbd
[STF] reduce access mode (#2830)
* Experiment to start introducing a reduction access mode used in kernels (eg. parallel_for).
* Add a trait to count the number of reductions required in a tuple of deps
* WIP: create a new scalar<T> interface which can be used in a reduction access mode, and start to implement all the mechanisms for reductions in parallel_for
* WIP ! Introduce owning_container_of trait class
* WIP: save progress here, lots of hardcoded things and we need to move to cuda::std::tuple
* WIP : first prototype working...
* Proper initialization of shared memory buffers, and add another example
* Some cleanups and renaming of classes for better clarity
* clang-format
* workaround some false unused captured variable warning
* Fix various C++ errors, and do not use the I variable
* Rework the CFD example to use reductions, and generalize the transfer_host method
* clang-format
* Implement transfer_host (name subject to change !) directly in the context
* clang-format
* Make it possible to either accumulate a reduction result with an existing value, or initialize a new one
* Implement a set of predefined reducers
* clang-format
* move the definition of do_init and no_init
* update word count example
* Code simplification to facilitate the transition to ::cuda::std::tuple
* Use ::cuda::std::tuple for reduction variables
* use proper type for the size of buffers
* clang-format
* remove unused variables
* fix buffer size
* add missing typename
* Add missing typename
* Add maybe_unused for variables currently unused in a WIP code
* clang-format
* add a doxygen comment
* Add missing constructors
* Code cleanup
* remove dead code
* task_dep_op_none should just be a tag type, there is no need to implement a fake apply_op operation
* Remove dead code
* Remove unused template parameter
* Slightly simpler count_type trait
* clang-format
* Add a small unit test to test count_type_v
* Do not define both no_init and do_init types anymore, just expose no_init to
user, then use true_type and false_type internally. Also rename reduce_do_init
to reduce for clarity, as this is the most common case.
* sort examples in cmake
* clang-format
* Simplify redux_vars
* Use ::std::monostate instead of EmptyType
* Simplify redux_vars
* clang-format
* Add a missing doxygen comment
* Replace 01-axpy-reduce.cu with 09-dot-reduce.cu which is a more meaningful example
* clang-format
* fix word count example
* Minimize copying of dependencies
* - Fix how we load data in shared memory during the finalization kernel of the reduction.
- Fix errors where block size and grid size was inverted
* clang-format
* Example to compute pi using Monte Carlo method
* Add a unit test to ensure the reduce access mode works
* clang-format
* Not all ascii chars between A and z are alphanumerical chars
* remove dead code
* minor cleanups
* Not all ascii chars between A and z are alphanumerical chars
* no need for type alias when we use it once only
* Fix pi test
* Move reduction operator and init flag to task_dep, step 1
* Add a new test to check that the scalar interface works as expected (it is currently broken on graphs)
* Fully implement the scalar interface
* fix potentially uninialized variable warnings
* fix unused variable warning
* Add a test to ensure we properly deal with empty shapes in parallel_for : it
indeed requires to initialize the variable. This is failing currently because
we did not implement this initialization.
* clang-format
* Implement the CUDA kernel for reduction with empty shapes
* Move reduction operator and init flag to task_dep, step 2: parallel_for_scope is templated on deps_ops not their arguments
* Move reduction operator and init flag to task_dep, step 3: make parallel_for overloads more generic
* Fix the finalize kernel if there are more threads than items
* clang-format
* Implementation of the reduce access mode for CUDA graphs
* Test empty shapes with reductions on both stream and graphs
* Move reduction operator and init flag to task_dep, step 4: eliminate task_dep_op entirely
* clang-format
* fix parallel_for on host
* Disable nvrtc workaround (#1116)
nvbug3961621 has been fixed in 12.2
Addresses nvbug4263883
* Tighten overloading of context::parallel_for
* clang-format
* Optimize loop function by hoisting lambda definition out of the loop and by using universal references for intermediate lambdas
* No need for SelectType
* A few more improvements
* Fix build
* Documentation for reduce()
* Improve doc for reduce()
* Rename transfer_host in wait
* doxygen blocks for reducer operators
* Add missing doxygen blocks or make them more accurate
* Remove commented code
* remove printf
* Add sanity checks to detect unimplemented uses of reduce()
* Fix a logic error
* remove maybe_unused that is not needed anymore
* Properly handle reduce on a CUDA graph that is not executed by device 0
* Reimplement pagerank using a reduce access mode
* No need to atomicMaxFloat when using a reduce(reducer::maxval<float>{})
* use references in calculating_pagerank
* Add a missing doxygen block for scalar<T>
* Remove count_type_v and count_type which are not used anymore
* replace an atomic add by a reduction
* Simpler scalar implementation with a struct
* Comment to clarify get_owning_container_of
* Remove useless ctor
* fix spelling issue
* clang-format
* Explain how we statically dispatch between the different task_dep(_untyped)
constructors if they are read only or not. This is something that could be
further improved !
* Do provide constructors for scalar<T>
---------
Co-authored-by: Andrei Alexandrescu <andrei@erdani.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>1 parent a67360d commit c025dbd
File tree
28 files changed
+1924
-413
lines changed- cudax
- examples/stf
- graph_algorithms
- include/cuda/experimental
- __stf
- graph
- internal
- stream
- test/stf
- interface
- parallel_for
- reductions
- docs/cudax
28 files changed
+1924
-413
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
20 | 19 | | |
| 20 | + | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
| 28 | + | |
26 | 29 | | |
27 | 30 | | |
28 | | - | |
29 | | - | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
| |||
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
| 39 | + | |
38 | 40 | | |
| 41 | + | |
39 | 42 | | |
40 | | - | |
41 | 43 | | |
42 | 44 | | |
| 45 | + | |
43 | 46 | | |
44 | 47 | | |
45 | 48 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
188 | 188 | | |
189 | 189 | | |
190 | 190 | | |
191 | | - | |
192 | | - | |
| 191 | + | |
193 | 192 | | |
194 | | - | |
| 193 | + | |
195 | 194 | | |
196 | | - | |
197 | | - | |
198 | | - | |
199 | | - | |
200 | | - | |
201 | | - | |
202 | | - | |
203 | | - | |
204 | | - | |
205 | | - | |
206 | | - | |
207 | | - | |
208 | | - | |
209 | | - | |
210 | | - | |
211 | | - | |
212 | | - | |
213 | | - | |
214 | | - | |
215 | | - | |
216 | | - | |
217 | | - | |
218 | | - | |
219 | | - | |
220 | | - | |
221 | | - | |
222 | | - | |
223 | | - | |
224 | | - | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
229 | | - | |
230 | | - | |
231 | | - | |
232 | | - | |
233 | | - | |
234 | | - | |
235 | | - | |
236 | | - | |
237 | | - | |
238 | | - | |
239 | | - | |
240 | | - | |
241 | | - | |
242 | | - | |
243 | | - | |
244 | | - | |
245 | | - | |
246 | | - | |
247 | | - | |
248 | | - | |
249 | | - | |
250 | | - | |
251 | | - | |
252 | | - | |
253 | | - | |
254 | | - | |
255 | | - | |
256 | | - | |
257 | | - | |
258 | | - | |
259 | | - | |
260 | | - | |
261 | | - | |
262 | | - | |
263 | | - | |
264 | | - | |
265 | | - | |
266 | | - | |
267 | | - | |
268 | | - | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
269 | 201 | | |
270 | | - | |
| 202 | + | |
271 | 203 | | |
272 | 204 | | |
273 | 205 | | |
| |||
422 | 354 | | |
423 | 355 | | |
424 | 356 | | |
425 | | - | |
| 357 | + | |
426 | 358 | | |
427 | 359 | | |
428 | 360 | | |
429 | | - | |
430 | | - | |
431 | | - | |
432 | | - | |
433 | | - | |
434 | | - | |
435 | | - | |
436 | | - | |
437 | | - | |
438 | | - | |
439 | | - | |
440 | | - | |
441 | | - | |
442 | | - | |
443 | | - | |
444 | | - | |
445 | | - | |
446 | | - | |
447 | | - | |
448 | | - | |
449 | | - | |
450 | | - | |
451 | | - | |
452 | | - | |
453 | | - | |
454 | | - | |
455 | | - | |
456 | | - | |
457 | | - | |
458 | | - | |
459 | | - | |
460 | | - | |
461 | | - | |
462 | | - | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
463 | 365 | | |
464 | 366 | | |
465 | 367 | | |
| |||
468 | 370 | | |
469 | 371 | | |
470 | 372 | | |
471 | | - | |
472 | | - | |
473 | | - | |
474 | | - | |
475 | | - | |
476 | | - | |
477 | | - | |
478 | | - | |
479 | | - | |
480 | | - | |
481 | | - | |
482 | | - | |
483 | | - | |
484 | | - | |
485 | | - | |
486 | | - | |
487 | | - | |
488 | | - | |
489 | | - | |
490 | | - | |
491 | | - | |
492 | | - | |
493 | | - | |
494 | | - | |
495 | | - | |
496 | | - | |
497 | | - | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
498 | 376 | | |
499 | 377 | | |
500 | 378 | | |
501 | | - | |
| 379 | + | |
502 | 380 | | |
503 | 381 | | |
504 | 382 | | |
| |||
525 | 403 | | |
526 | 404 | | |
527 | 405 | | |
528 | | - | |
| 406 | + | |
529 | 407 | | |
530 | 408 | | |
531 | 409 | | |
532 | | - | |
| 410 | + | |
533 | 411 | | |
534 | 412 | | |
535 | 413 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | 24 | | |
41 | 25 | | |
42 | 26 | | |
| |||
49 | 33 | | |
50 | 34 | | |
51 | 35 | | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
56 | 40 | | |
57 | 41 | | |
58 | 42 | | |
| |||
77 | 61 | | |
78 | 62 | | |
79 | 63 | | |
80 | | - | |
81 | 64 | | |
82 | 65 | | |
83 | 66 | | |
| |||
88 | 71 | | |
89 | 72 | | |
90 | 73 | | |
91 | | - | |
| 74 | + | |
92 | 75 | | |
93 | 76 | | |
94 | 77 | | |
95 | 78 | | |
96 | | - | |
97 | | - | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
98 | 88 | | |
99 | | - | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
| 89 | + | |
110 | 90 | | |
111 | 91 | | |
112 | 92 | | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
| 93 | + | |
119 | 94 | | |
120 | 95 | | |
121 | 96 | | |
| |||
0 commit comments