You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sergisiso opened this issue
Feb 12, 2025
· 2 comments
Assignees
Labels
LFRicIssue relates to the LFRic domainNEMOIssue relates to the NEMO domainNG-ARCHIssues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH
This is similar to #2671, there are cases that in order to parallelise a outer loop we need to make a temporary array that all the iterations of a loops uses private so that no dependencies are reported between the usage in each iteration. #2671 focused on doing it using the "private" clause of the directive, but another alternative is increasing the rank of the work array so that each iteration uses a different location.
This is useful in both NEMO and LFRic.
In NEMO the work arrays and loops are in the same file (e.g. src/OCE/ZDF/zdftke.F90)
real(kind=wp), dimension(nlay_i) :: ztmp
! We want to parallelise this loop
do ji =1, npti, 1do jk =1, nlay_i, 1
ztmp(jk) =3enddo
! In the code this loops cannot be fused, otherwise we could just do it and scalarise ztmp
do jk =1, nlay_i, 1
field(jk,jj) = ztmp(jk)
enddoend do
The proposed transformation will take a symbol and a loop and increase the rank of the symbol declaration with the bounds of the loop and each reference inside the loop with the loop iteration variable. No references can be found outside the loop. The resulting code will be:
real(kind=wp), dimension(nlay_i, 1:npti) :: ztmp
! We want to parallelise this loop
do ji =1, npti, 1do jk =1, nlay_i, 1
ztmp(jk, ji) =3enddo
! In the code this loops can not be fused, otherwise we could just do it and scalarise ztmp
do jk =1, nlay_i, 1
field(jk, ji) = ztmp(jk, ji)
enddoend do
In LFRic is a bit more complicated because there is a call between the loop and the loop body with the temporaries are local arrays inside the kernel:
do cell =1, num_cells
call kernel(..., map_field1(:,cell), map_field2(:,cell))
enddosubroutinekernel(...)
real, dimension(ndf1) :: x_e
do k =0, nlayers-1do df =1, ndf2
x_e(df) = x(map2(df)+k)
end do
... uses x_e and cannot be fused&scalarized ...
enddoendsubroutine kernel
So array (x_e) and the loop (over cells) are in different locations. We have 2 alternatives:
Move the local work array to the module scope, pass the cell as an argument, apply the transformation with the loop and the references to update in different scopes (seems dangerous).
convert the local array to an argument. Apply the rank-increasing transformation just at the call site.
The second alternative seems easier/safer, the resulting code should look:
real, dimension(1:num_cells, ndf1) :: x_e
do cell =1, num_cells
call kernel(..., map_field1(:,cell), map_field2(:,cell), x_e(:, cell) )
enddosubroutinekernel(...)
real, dimension(ndf1), intent(in) :: x_e
do k =0, nlayers-1do df =1, ndf2
x_e(df) = x(map2(df)+k)
end do
... uses x_e and can not be fused&scalarized ...
enddoendsubroutine kernel
The text was updated successfully, but these errors were encountered:
What I still don't understand is why the performance is different (x2 in some test) than with the private (clauses or local scope). I would assume that private heap-allocated arrays end up in some kind of pre-allocated arena (this is all implementation details OpenMP/Fortan don't prescribe - or even mention anything - about this). We have seen the performance improve by setting export CRAY_ACC_MALLOC_HEAPSIZE=512MB and export NV_ACC_POOL_THRESHOLD=75, is it that we request too much of their internal pre-allocated spaces?
Pre-allocated work arrays is something that I thought implementing if the transformation above uses too much space (in LFRic I see 550 of these work arrays), but then this will just cycle back to something similar to what the compilers may already implement internally.
There is also the idea of replacing the existing dimension to a constant number (equal or larger than the runtime value) instead of increasing the rank in order to favour stack or register temporaries e.g. in LFRic some are as small as real, dimension(10) :: tmp, in NEMO a jk dimension is O(100). Would that impact performance?
Regardless, I will implement the transformation as stated above, psyclone goal is to provide the different options, then scritps for each platform/compiler choose what it works for them.
after a chat with @addy419 , the aspect that I was missing for NEMO is that the "increase-rank" is just one step in the planned optimisation pipeline. The outer loops that we are targeting will also need loop splitting once outer loop parallelisation is possible. This mean that the "temporay arrays" will need to survive between GPU kernel launches (in GPU memory), and therefore, keeping them in arrays is better than using the private clauses mechanisms.
zdftke.f90 and dynzdf.f90 are good examples for this
sergisiso
added
NG-ARCH
Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH
NEMO
Issue relates to the NEMO domain
LFRic
Issue relates to the LFRic domain
labels
Feb 20, 2025
LFRicIssue relates to the LFRic domainNEMOIssue relates to the NEMO domainNG-ARCHIssues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH
This is similar to #2671, there are cases that in order to parallelise a outer loop we need to make a temporary array that all the iterations of a loops uses private so that no dependencies are reported between the usage in each iteration. #2671 focused on doing it using the "private" clause of the directive, but another alternative is increasing the rank of the work array so that each iteration uses a different location.
This is useful in both NEMO and LFRic.
In NEMO the work arrays and loops are in the same file (e.g. src/OCE/ZDF/zdftke.F90)
The proposed transformation will take a symbol and a loop and increase the rank of the symbol declaration with the bounds of the loop and each reference inside the loop with the loop iteration variable. No references can be found outside the loop. The resulting code will be:
In LFRic is a bit more complicated because there is a call between the loop and the loop body with the temporaries are local arrays inside the kernel:
So array (x_e) and the loop (over cells) are in different locations. We have 2 alternatives:
The second alternative seems easier/safer, the resulting code should look:
The text was updated successfully, but these errors were encountered: