Increase array rank to eliminate outer loop dependencies for temporary arrays #2896

sergisiso · 2025-02-12T14:24:49Z

This is similar to #2671, there are cases that in order to parallelise a outer loop we need to make a temporary array that all the iterations of a loops uses private so that no dependencies are reported between the usage in each iteration. #2671 focused on doing it using the "private" clause of the directive, but another alternative is increasing the rank of the work array so that each iteration uses a different location.

This is useful in both NEMO and LFRic.

In NEMO the work arrays and loops are in the same file (e.g. src/OCE/ZDF/zdftke.F90)

real(kind=wp), dimension(nlay_i) :: ztmp
! We want to parallelise this loop
do ji = 1, npti, 1
  do jk = 1, nlay_i, 1
     ztmp(jk) = 3
  enddo
  ! In the code this loops cannot be fused, otherwise we could just do it and scalarise ztmp
  do jk = 1, nlay_i, 1
     field(jk,jj) = ztmp(jk)
  enddo
end do

The proposed transformation will take a symbol and a loop and increase the rank of the symbol declaration with the bounds of the loop and each reference inside the loop with the loop iteration variable. No references can be found outside the loop. The resulting code will be:

real(kind=wp), dimension(nlay_i, 1:npti) :: ztmp
! We want to parallelise this loop
do ji = 1, npti, 1
  do jk = 1, nlay_i, 1
     ztmp(jk, ji) = 3
  enddo
  ! In the code this loops can not be fused, otherwise we could just do it and scalarise ztmp
  do jk = 1, nlay_i, 1
     field(jk, ji) = ztmp(jk, ji)
  enddo
end do

In LFRic is a bit more complicated because there is a call between the loop and the loop body with the temporaries are local arrays inside the kernel:

do cell = 1, num_cells
  call kernel(..., map_field1(:,cell), map_field2(:,cell))
enddo

subroutine kernel(...)
  real, dimension(ndf1) :: x_e
	
   do k = 0, nlayers-1
     do df = 1, ndf2
       x_e(df) = x(map2(df)+k)
     end do
     ... uses x_e and cannot be fused&scalarized ...
   enddo
   
end subroutine kernel

So array (x_e) and the loop (over cells) are in different locations. We have 2 alternatives:

Move the local work array to the module scope, pass the cell as an argument, apply the transformation with the loop and the references to update in different scopes (seems dangerous).
convert the local array to an argument. Apply the rank-increasing transformation just at the call site.

The second alternative seems easier/safer, the resulting code should look:

real, dimension(1:num_cells, ndf1) :: x_e

do cell = 1, num_cells
    call kernel(..., map_field1(:,cell), map_field2(:,cell), x_e(:, cell) )
enddo

subroutine kernel(...)
   real, dimension(ndf1), intent(in) :: x_e

   do k = 0, nlayers-1
     do df = 1, ndf2
       x_e(df) = x(map2(df)+k)
     end do
     ... uses x_e and can not be fused&scalarized ...
   enddo
   
end subroutine kernel

sergisiso · 2025-02-12T15:09:23Z

What I still don't understand is why the performance is different (x2 in some test) than with the private (clauses or local scope). I would assume that private heap-allocated arrays end up in some kind of pre-allocated arena (this is all implementation details OpenMP/Fortan don't prescribe - or even mention anything - about this). We have seen the performance improve by setting export CRAY_ACC_MALLOC_HEAPSIZE=512MB and export NV_ACC_POOL_THRESHOLD=75, is it that we request too much of their internal pre-allocated spaces?

Pre-allocated work arrays is something that I thought implementing if the transformation above uses too much space (in LFRic I see 550 of these work arrays), but then this will just cycle back to something similar to what the compilers may already implement internally.

There is also the idea of replacing the existing dimension to a constant number (equal or larger than the runtime value) instead of increasing the rank in order to favour stack or register temporaries e.g. in LFRic some are as small as real, dimension(10) :: tmp, in NEMO a jk dimension is O(100). Would that impact performance?

Regardless, I will implement the transformation as stated above, psyclone goal is to provide the different options, then scritps for each platform/compiler choose what it works for them.

sergisiso · 2025-02-20T16:24:09Z

after a chat with @addy419 , the aspect that I was missing for NEMO is that the "increase-rank" is just one step in the planned optimisation pipeline. The outer loops that we are targeting will also need loop splitting once outer loop parallelisation is possible. This mean that the "temporay arrays" will need to survive between GPU kernel launches (in GPU memory), and therefore, keeping them in arrays is better than using the private clauses mechanisms.

zdftke.f90 and dynzdf.f90 are good examples for this

sergisiso added this to NEMO Offloading (NG-ARCH) Feb 20, 2025

sergisiso self-assigned this Feb 20, 2025

sergisiso added NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH NEMO Issue relates to the NEMO domain LFRic Issue relates to the LFRic domain labels Feb 20, 2025

sergisiso added a commit that referenced this issue Feb 21, 2025

#2896 Start implementing IncreaseRankLoopArraysTrans

a3602c4

sergisiso added a commit that referenced this issue Feb 25, 2025

#2896 Initial implementation of IncreaseRankLoopArrays

8df8d1d

sergisiso added a commit that referenced this issue Feb 28, 2025

#2896 Apply IncreaseRank to NEMO

71ef250

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase array rank to eliminate outer loop dependencies for temporary arrays #2896

Increase array rank to eliminate outer loop dependencies for temporary arrays #2896

sergisiso commented Feb 12, 2025 •

edited

Loading

sergisiso commented Feb 12, 2025

sergisiso commented Feb 20, 2025

Increase array rank to eliminate outer loop dependencies for temporary arrays #2896

Increase array rank to eliminate outer loop dependencies for temporary arrays #2896

Comments

sergisiso commented Feb 12, 2025 • edited Loading

sergisiso commented Feb 12, 2025

sergisiso commented Feb 20, 2025

sergisiso commented Feb 12, 2025 •

edited

Loading