Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase array rank to eliminate outer loop dependencies for temporary arrays #2896

Open
sergisiso opened this issue Feb 12, 2025 · 2 comments
Assignees
Labels
LFRic Issue relates to the LFRic domain NEMO Issue relates to the NEMO domain NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH

Comments

@sergisiso
Copy link
Collaborator

sergisiso commented Feb 12, 2025

This is similar to #2671, there are cases that in order to parallelise a outer loop we need to make a temporary array that all the iterations of a loops uses private so that no dependencies are reported between the usage in each iteration. #2671 focused on doing it using the "private" clause of the directive, but another alternative is increasing the rank of the work array so that each iteration uses a different location.

This is useful in both NEMO and LFRic.

In NEMO the work arrays and loops are in the same file (e.g. src/OCE/ZDF/zdftke.F90)

real(kind=wp), dimension(nlay_i) :: ztmp
! We want to parallelise this loop
do ji = 1, npti, 1
  do jk = 1, nlay_i, 1
     ztmp(jk) = 3
  enddo
  ! In the code this loops cannot be fused, otherwise we could just do it and scalarise ztmp
  do jk = 1, nlay_i, 1
     field(jk,jj) = ztmp(jk)
  enddo
end do

The proposed transformation will take a symbol and a loop and increase the rank of the symbol declaration with the bounds of the loop and each reference inside the loop with the loop iteration variable. No references can be found outside the loop. The resulting code will be:

real(kind=wp), dimension(nlay_i, 1:npti) :: ztmp
! We want to parallelise this loop
do ji = 1, npti, 1
  do jk = 1, nlay_i, 1
     ztmp(jk, ji) = 3
  enddo
  ! In the code this loops can not be fused, otherwise we could just do it and scalarise ztmp
  do jk = 1, nlay_i, 1
     field(jk, ji) = ztmp(jk, ji)
  enddo
end do

In LFRic is a bit more complicated because there is a call between the loop and the loop body with the temporaries are local arrays inside the kernel:

do cell = 1, num_cells
  call kernel(..., map_field1(:,cell), map_field2(:,cell))
enddo

subroutine kernel(...)
  real, dimension(ndf1) :: x_e
	
   do k = 0, nlayers-1
     do df = 1, ndf2
       x_e(df) = x(map2(df)+k)
     end do
     ... uses x_e and cannot be fused&scalarized ...
   enddo
   
end subroutine kernel

So array (x_e) and the loop (over cells) are in different locations. We have 2 alternatives:

  • Move the local work array to the module scope, pass the cell as an argument, apply the transformation with the loop and the references to update in different scopes (seems dangerous).
  • convert the local array to an argument. Apply the rank-increasing transformation just at the call site.

The second alternative seems easier/safer, the resulting code should look:

real, dimension(1:num_cells, ndf1) :: x_e

do cell = 1, num_cells
    call kernel(..., map_field1(:,cell), map_field2(:,cell), x_e(:, cell) )
enddo

subroutine kernel(...)
   real, dimension(ndf1), intent(in) :: x_e

   do k = 0, nlayers-1
     do df = 1, ndf2
       x_e(df) = x(map2(df)+k)
     end do
     ... uses x_e and can not be fused&scalarized ...
   enddo
   
end subroutine kernel
@sergisiso
Copy link
Collaborator Author

What I still don't understand is why the performance is different (x2 in some test) than with the private (clauses or local scope). I would assume that private heap-allocated arrays end up in some kind of pre-allocated arena (this is all implementation details OpenMP/Fortan don't prescribe - or even mention anything - about this). We have seen the performance improve by setting export CRAY_ACC_MALLOC_HEAPSIZE=512MB and export NV_ACC_POOL_THRESHOLD=75, is it that we request too much of their internal pre-allocated spaces?

Pre-allocated work arrays is something that I thought implementing if the transformation above uses too much space (in LFRic I see 550 of these work arrays), but then this will just cycle back to something similar to what the compilers may already implement internally.

There is also the idea of replacing the existing dimension to a constant number (equal or larger than the runtime value) instead of increasing the rank in order to favour stack or register temporaries e.g. in LFRic some are as small as real, dimension(10) :: tmp, in NEMO a jk dimension is O(100). Would that impact performance?

Regardless, I will implement the transformation as stated above, psyclone goal is to provide the different options, then scritps for each platform/compiler choose what it works for them.

@sergisiso
Copy link
Collaborator Author

after a chat with @addy419 , the aspect that I was missing for NEMO is that the "increase-rank" is just one step in the planned optimisation pipeline. The outer loops that we are targeting will also need loop splitting once outer loop parallelisation is possible. This mean that the "temporay arrays" will need to survive between GPU kernel launches (in GPU memory), and therefore, keeping them in arrays is better than using the private clauses mechanisms.

zdftke.f90 and dynzdf.f90 are good examples for this

@sergisiso sergisiso self-assigned this Feb 20, 2025
@sergisiso sergisiso added NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH NEMO Issue relates to the NEMO domain LFRic Issue relates to the LFRic domain labels Feb 20, 2025
sergisiso added a commit that referenced this issue Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LFRic Issue relates to the LFRic domain NEMO Issue relates to the NEMO domain NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH
Projects
Status: No status
Development

No branches or pull requests

1 participant