-
Notifications
You must be signed in to change notification settings - Fork 77
Open
Description
In tests using stmatrix, e.g, MmaTest/HopperRSStmatrix.SingleTileWithTMALoadStoreStMatrix/4, generated code has a pattern like:
#pragma unroll
for(nvfuser_index_t i39 = 0; i39 < 3; ++i39) {
arraySet<__half, 4>(&T5[(i8 + (8 * i39))], (__half)0);
}
#pragma unroll
for(nvfuser_index_t i39 = 0; i39 < 3; ++i39) {
if ((b13 && (i14 < (-(8 * i39))))) {
stmatrix2(...);
Here, T5 is initialized to 0 for predicate elimination and then set by the following stmatrix op. To make this work correctly, a syncthreads should be placed between the arraySet and stmatrix2; otherwise, there could be a WAW race.
A true fix, however, would be likely removing the zero initialization. It should not be necessary as the whole buffer is entirely written into.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels