SCC-STACK for ecphys #135

MichaelSt98 · 2023-08-30T13:24:04Z

Note: this PR stack on top of #125 and should only be reviewed once this is merged.

Implements/Complements SCC Stack transformation for the ecphys.

allows for stack size definition in the config file (per routine), which is currently the solution for multi-library targets
allows for pragma !$loki stack-insert to define stack size assignment location in the code, necessary to allow variables used in the stack size expression to be initialised

github-actions · 2023-08-30T13:27:11Z

Documentation for this branch can be viewed at https://sites.ecmwf.int/docs/loki/135/index.html

codecov · 2023-08-30T14:00:22Z

Codecov Report

Merging #135 (8641294) into main (afe0c0e) will decrease coverage by 0.14%.
Report is 13 commits behind head on main.
The diff coverage is 73.97%.

@@            Coverage Diff             @@
##             main     #135      +/-   ##
==========================================
- Coverage   92.14%   92.01%   -0.14%     
==========================================
  Files          89       89              
  Lines       16563    16599      +36     
==========================================
+ Hits        15262    15273      +11     
- Misses       1301     1326      +25

Flag	Coverage Δ
lint_rules	`96.00% <ø> (+0.04%)`	⬆️
loki	`92.01% <44.44%> (-0.09%)`	⬇️
transformations	`91.09% <78.12%> (-0.57%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
transformations/transformations/argument_shape.py	`92.15% <ø> (ø)`
loki/bulk/scheduler.py	`80.15% <28.57%> (-0.94%)`	⬇️
transformations/transformations/pool_allocator.py	`92.03% <75.00%> (-3.86%)`	⬇️
loki/expression/mappers.py	`92.97% <100.00%> (-0.84%)`	⬇️
loki/transform/transform_array_indexing.py	`93.44% <100.00%> (ø)`
...mations/transformations/single_column_coalesced.py	`97.97% <100.00%> (+0.04%)`	⬆️

... and 7 files with indirect coverage changes

reuterbal

Many thanks, this is really a huge step forward and it's fantastic that this works.

However, I think before we bring this in the actual implementation could be improved in a number of ways. I've flagged them in individual comments but will repeat here the most important points in my eyes:

This has a hard requirement on providing per-transformation config values in the toml configuration file. I think now is the right moment to take care of Generic approach for transformation specific settings in SchedulerConfig #28 instead of using the implemented workaround.
Some of the implementation details are very opaque. It is entirely unclear to me why there is a need for the loki pragmas to specify certain injection locations. The entire transformation has become a little hard to read due to various additions, which may indicate that some refactoring is overdue.
The injection location for the stack argument seems to require some intricate index fuffing. I can't remember the exact details of the offline discussion but I think the problem is generic enough to justify making this available as a utility transformation.
The testing of the pool allocator transformation has become quite messy. I think it would be good to not convolute too many special cases and variations into the same test and instead replicate the tests and check the various cases separately. We can use fixtures to share commonalitites between the tests to avoid redundant code.

reuterbal · 2023-09-01T20:30:31Z

loki/bulk/scheduler.py

@@ -55,6 +55,13 @@ def __init__(self, default, routines, disable=None, dimensions=None, dic2p=None,
            self.routines = CaseInsensitiveDict(routines)
        else:
            self.routines = CaseInsensitiveDict((r.name, r) for r in as_tuple(routines))
+        for routine in self.routines:


Is this a subtle way of providing transformation config via the config file by storing it as values to a routine with name trafo? I would argue that we should implement #28 and thus have a proper way of providing such configs rather than using a hacky kind of implementation like this.

reuterbal · 2023-09-01T20:31:13Z

transformations/tests/test_pool_allocator.py

-                          definitions=definitions)
+    stack_size = "400*nz*nlon"
+
+    if stack_size_config == "dict":


I think for readability it would be better to separate this into different tests.

reuterbal · 2023-09-01T20:34:00Z

transformations/transformations/pool_allocator.py

@@ -121,6 +122,7 @@ def __init__(self, block_dim,
        self.local_ptr_var_name_pattern = local_ptr_var_name_pattern
        self.directive = directive
        self.check_bounds = check_bounds
+        self.trafo_dict = {}


Does this really need to be a property on the Transformation class object? It gets overwritten for every item/subroutine that is being processed, so shouldn't really be stored as a state on the object.

reuterbal · 2023-09-01T20:36:12Z

transformations/transformations/pool_allocator.py

+                                                                            parameters=(stack_size, stack_type_bytes)),
+                                                                 '==', Literal(0))), inline=True, body=(padding,),


Something odd is going on with formatting here

reuterbal · 2023-09-01T20:37:34Z

transformations/transformations/pool_allocator.py

+                                          parameters=(stack_type_bytes, Literal(8)), kw_parameters=())
+            if "stack-size" in self.trafo_dict:
+                scope = Scope()
+                #TODO: parse_fparser_expression only for fparser ...


I'd argue it's acceptable to have a hard dependency on this here

reuterbal · 2023-09-01T20:38:39Z

transformations/transformations/pool_allocator.py

@@ -345,12 +358,24 @@ def _get_stack_storage_and_size_var(self, routine, stack_size):
        if variables_append:
            routine.variables += as_tuple(variables_append)
        if body_prepend:
-            routine.body.prepend(body_prepend)
+            if not self._insert_stack_at_loki_pragma(routine, body_prepend):


Why is there a requirement for this pragma? Can the injection point lookup be improved instead?

reuterbal · 2023-09-01T20:41:01Z

transformations/transformations/pool_allocator.py

@@ -13,7 +13,8 @@
    SymbolAttributes, BasicType, DerivedType, Quotient, IntLiteral, LogicLiteral,
    Variable, Array, Sum, Literal, Product, InlineCall, Comparison, RangeIndex, Cast,
    Intrinsic, Assignment, Conditional, CallStatement, Import, Allocation, Deallocation, is_dimension_constant,
-    Loop, Pragma, SubroutineItem, FindInlineCalls, Interface, ProcedureSymbol, LogicalNot, dataflow_analysis_attached
+    Loop, Pragma, SubroutineItem, FindInlineCalls, Interface, ProcedureSymbol, LogicalNot, dataflow_analysis_attached,
+    expression, parse_fparser_expression, Scope


I'd prefer to either import the relevant expressions directly (e.g. RangeIndex) or to use the elsewhere commonly used pattern import loki.expression.symbols as sym

reuterbal · 2023-09-01T20:41:26Z

transformations/transformations/pool_allocator.py

        """
        Add the pool allocator argument into subroutine calls
        """
        call_map = {}
        stack_var = self._get_local_stack_var(routine)
        for call in FindNodes(CallStatement).visit(routine.body):
-            if call.name in targets:
+            if call.name in targets or call.routine.name.lower() in ignore: #['surfpp', 'surfexcdriver', 'surfseb']:


debug leftover?

reuterbal · 2023-09-01T20:42:06Z

transformations/transformations/pool_allocator.py

@@ -129,7 +131,11 @@ def transform_subroutine(self, routine, **kwargs):

        role = kwargs['role']
        item = kwargs.get('item', None)
+        ignore = item.ignore if item else ()


Shouldn't ignored routines be excluded from targets?

reuterbal · 2023-09-01T20:44:25Z

transformations/transformations/pool_allocator.py

+                if call.routine != BasicType.DEFERRED and self.stack_local_var_name not in call.arguments:
+                    if self.stack_argument_name in call.routine.arguments:
+                        arg_idx = call.routine.arguments.index(self.stack_argument_name)
+                        arguments = call.arguments
+                        kwarguments = call.kwarguments
+                        if arg_idx <= len(arguments):
+                            arguments = arguments[:arg_idx] + (stack_var,) + arguments[arg_idx:]
+                        else:
+                            arg_idx -= len(arguments)
+                            kwarguments = kwarguments[:arg_idx] + ((self.stack_argument_name, stack_var),) + \
+                                          kwarguments[arg_idx:]
+                        call_map[call] = call.clone(arguments=arguments, kwarguments=kwarguments)
+                    else:
+                        arguments = call.arguments
+                        kwarguments = call.kwarguments
+                        arg_pos = [call.routine.arguments.index(arg) for arg in call.routine.arguments
+                                   if arg.type.optional]
+                        if arg_pos:
+                            if arg_pos[0] < len(arguments):
+                                arguments = arguments[:arg_pos[0]] + (stack_var,) + arguments[arg_pos[0]:]
+                            else:
+                                arg_idx = arg_pos[0] - len(arguments)
+                                kwarguments = kwarguments[:arg_idx] + ((self.stack_argument_name, stack_var),) + \
+                                              kwarguments[arg_idx:]
+                        else:
+                            if kwarguments:
+                                kwarguments += ((self.stack_argument_name, stack_var),)
+                            else:
+                                arguments += (stack_var,)
+                        call_map[call] = call.clone(arguments=as_tuple(arguments), kwarguments=as_tuple(kwarguments))


What's going on here? I suspect this has to do with picking the right injection point when there are keyword arguments present?
That index-fuffing is a bit hard to follow, could this be made more elegant or maybe even outlined into a utility routine (because I think we discussed this interface update problem in a wider context before)?

MichaelSt98 · 2023-09-04T07:53:29Z

Many thanks, this is really a huge step forward and it's fantastic that this works.

However, I think before we bring this in the actual implementation could be improved in a number of ways. I've flagged them in individual comments but will repeat here the most important points in my eyes:

Thanks @reuterbal for having a look and thanks for the comments.

This has a hard requirement on providing per-transformation config values in the toml configuration file. I think now is the right moment to take care of Generic approach for transformation specific settings in SchedulerConfig #28 instead of using the implemented workaround.

I agree. However, for this transformation we need transformation and routine specific config entries. Therefore it's (at least in my opinion) up for discussion what the best way is to do this or rather whether it is like described in Generic approach for transformation specific settings in SchedulerConfig #28.

Some of the implementation details are very opaque. It is entirely unclear to me why there is a need for the loki pragmas to specify certain injection locations. The entire transformation has become a little hard to read due to various additions, which may indicate that some refactoring is overdue.

The Loki pragmas to specify certain injection locations is due to the fact that variables that compose the stack size calculation may not be initialised when the stack size is calculated. For example, the original implementation could (and indeed produces something like this for ecphys), causing either compile time or runtime errors ...

subroutine driver(...)
  type(dimension) :: dims 
  ISTZ = ... klon ...
  ALLOCATE(ZSTACK(ISTSZ)) 
  call kernel(dims%klon, ...)
end subroutine driver

subroutine kernel(klon, ...)
  integer, intent(in) ::  klon
  real temp(klon)
end subroutine kernel()

The simplest solution is to just define klon as variable in the driver and assign the value dims%klon to it in the body of the driver routine. However, this wouldn't solve the problem after all, as the calculation of the stack size would be inserted above klon = dims%klon, which can be solved by having the Loki pragma !$loki stack-insert.

Of course, this can be further automated, however multiple cases can occur, amongst others:

variable is a module variable, thus the import would need to be recreated at the driver level
variable is a derived type entry or rather more general, variable is renamed in the kernel level's and this renaming has to be resolved somehow
variable is a parameter definition at the kernel level and would need to be recreated at the driver level
...

The injection location for the stack argument seems to require some intricate index fuffing. I can't remember the exact details of the offline discussion but I think the problem is generic enough to justify making this available as a utility transformation.

True. We would/should have some add_args(), remove_args(), check_args() for call statements. However, in our previous offline discussion we assumed to have the relevant information already in the actual routine information, which is not necessarily the case for multi-library Loki transformation. This makes the actual implementation more difficult or at least something that needs to be discussed.

The testing of the pool allocator transformation has become quite messy. I think it would be good to not convolute too many special cases and variations into the same test and instead replicate the tests and check the various cases separately. We can use fixtures to share commonalitites between the tests to avoid redundant code.

Yes 👍

awnawab · 2023-09-25T09:47:29Z

transformations/transformations/pool_allocator.py

@@ -292,15 +298,22 @@ def _get_stack_storage_and_size_var(self, routine, stack_size):
            stack_type_bytes = Cast(name='REAL', expression=Literal(1), kind=_kind)
            stack_type_bytes = InlineCall(Variable(name='C_SIZEOF'),
                                          parameters=as_tuple(stack_type_bytes))
+            stack_type_bytes = InlineCall(function=Variable(name='MAX'),


I believe this was added to ensure that each entry in the stack is aligned to a 32-byte boundary. I think this only works if the array has at least has 4 entries. This is a much safer assumption than relying on the size being a multiple of NPROMA, but nevertheless something to be aware of.

Also, this should be left on by default but configurable so that for kernels where we know that we don't need to add a floor to the stack entries we can disable it and have a smaller stack size.

reuterbal · 2024-02-27T14:06:24Z

Superseeded by #222

mlange05 added 3 commits August 11, 2023 10:39

SCC: Add logging of private array markers in vector loop annotations

6cd69eb

SCC: Add logging of vector-within-seq markers, which indicate issues

4d3d827

SCC/Transform: Harmonize logging nomenclature slightly...

c75cbea

MichaelSt98 requested review from mlange05 and reuterbal August 30, 2023 13:24

SCC-STACK for ecphys

8641294

MichaelSt98 force-pushed the nams_ecphys_scc_stack branch from ebe7bd1 to 8641294 Compare August 30, 2023 13:31

reuterbal requested changes Sep 1, 2023

View reviewed changes

awnawab reviewed Sep 25, 2023

View reviewed changes

mlange05 mentioned this pull request Jan 29, 2024

Enabling SCC-Stack for EC-physics, part 2 #222

Merged

reuterbal closed this Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCC-STACK for ecphys #135

SCC-STACK for ecphys #135

MichaelSt98 commented Aug 30, 2023 •

edited

Loading

github-actions bot commented Aug 30, 2023

codecov bot commented Aug 30, 2023 •

edited

Loading

reuterbal left a comment •

edited

Loading

reuterbal Sep 1, 2023 •

edited

Loading

reuterbal Sep 1, 2023

reuterbal Sep 1, 2023

reuterbal Sep 1, 2023

reuterbal Sep 1, 2023

reuterbal Sep 1, 2023

reuterbal Sep 1, 2023

reuterbal Sep 1, 2023

reuterbal Sep 1, 2023

reuterbal Sep 1, 2023

MichaelSt98 commented Sep 4, 2023

awnawab Sep 25, 2023

reuterbal commented Feb 27, 2024

		parameters=(stack_size, stack_type_bytes)),
		'==', Literal(0))), inline=True, body=(padding,),

SCC-STACK for ecphys #135

SCC-STACK for ecphys #135

Conversation

MichaelSt98 commented Aug 30, 2023 • edited Loading

github-actions bot commented Aug 30, 2023

codecov bot commented Aug 30, 2023 • edited Loading

Codecov Report

reuterbal left a comment • edited Loading

Choose a reason for hiding this comment

reuterbal Sep 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelSt98 commented Sep 4, 2023

Choose a reason for hiding this comment

reuterbal commented Feb 27, 2024

MichaelSt98 commented Aug 30, 2023 •

edited

Loading

codecov bot commented Aug 30, 2023 •

edited

Loading

reuterbal left a comment •

edited

Loading

reuterbal Sep 1, 2023 •

edited

Loading