Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCC-STACK for ecphys #135

Closed
wants to merge 4 commits into from
Closed

SCC-STACK for ecphys #135

wants to merge 4 commits into from

Conversation

MichaelSt98
Copy link
Collaborator

@MichaelSt98 MichaelSt98 commented Aug 30, 2023

Note: this PR stack on top of #125 and should only be reviewed once this is merged.

Implements/Complements SCC Stack transformation for the ecphys.

  • allows for stack size definition in the config file (per routine), which is currently the solution for multi-library targets
  • allows for pragma !$loki stack-insert to define stack size assignment location in the code, necessary to allow variables used in the stack size expression to be initialised

@github-actions
Copy link

Documentation for this branch can be viewed at https://sites.ecmwf.int/docs/loki/135/index.html

@codecov
Copy link

codecov bot commented Aug 30, 2023

Codecov Report

Merging #135 (8641294) into main (afe0c0e) will decrease coverage by 0.14%.
Report is 13 commits behind head on main.
The diff coverage is 73.97%.

@@            Coverage Diff             @@
##             main     #135      +/-   ##
==========================================
- Coverage   92.14%   92.01%   -0.14%     
==========================================
  Files          89       89              
  Lines       16563    16599      +36     
==========================================
+ Hits        15262    15273      +11     
- Misses       1301     1326      +25     
Flag Coverage Δ
lint_rules 96.00% <ø> (+0.04%) ⬆️
loki 92.01% <44.44%> (-0.09%) ⬇️
transformations 91.09% <78.12%> (-0.57%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
transformations/transformations/argument_shape.py 92.15% <ø> (ø)
loki/bulk/scheduler.py 80.15% <28.57%> (-0.94%) ⬇️
transformations/transformations/pool_allocator.py 92.03% <75.00%> (-3.86%) ⬇️
loki/expression/mappers.py 92.97% <100.00%> (-0.84%) ⬇️
loki/transform/transform_array_indexing.py 93.44% <100.00%> (ø)
...mations/transformations/single_column_coalesced.py 97.97% <100.00%> (+0.04%) ⬆️

... and 7 files with indirect coverage changes

Copy link
Collaborator

@reuterbal reuterbal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks, this is really a huge step forward and it's fantastic that this works.

However, I think before we bring this in the actual implementation could be improved in a number of ways. I've flagged them in individual comments but will repeat here the most important points in my eyes:

  1. This has a hard requirement on providing per-transformation config values in the toml configuration file. I think now is the right moment to take care of Generic approach for transformation specific settings in SchedulerConfig #28 instead of using the implemented workaround.
  2. Some of the implementation details are very opaque. It is entirely unclear to me why there is a need for the loki pragmas to specify certain injection locations. The entire transformation has become a little hard to read due to various additions, which may indicate that some refactoring is overdue.
  3. The injection location for the stack argument seems to require some intricate index fuffing. I can't remember the exact details of the offline discussion but I think the problem is generic enough to justify making this available as a utility transformation.
  4. The testing of the pool allocator transformation has become quite messy. I think it would be good to not convolute too many special cases and variations into the same test and instead replicate the tests and check the various cases separately. We can use fixtures to share commonalitites between the tests to avoid redundant code.

@@ -55,6 +55,13 @@ def __init__(self, default, routines, disable=None, dimensions=None, dic2p=None,
self.routines = CaseInsensitiveDict(routines)
else:
self.routines = CaseInsensitiveDict((r.name, r) for r in as_tuple(routines))
for routine in self.routines:
Copy link
Collaborator

@reuterbal reuterbal Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a subtle way of providing transformation config via the config file by storing it as values to a routine with name trafo? I would argue that we should implement #28 and thus have a proper way of providing such configs rather than using a hacky kind of implementation like this.

definitions=definitions)
stack_size = "400*nz*nlon"

if stack_size_config == "dict":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for readability it would be better to separate this into different tests.

@@ -121,6 +122,7 @@ def __init__(self, block_dim,
self.local_ptr_var_name_pattern = local_ptr_var_name_pattern
self.directive = directive
self.check_bounds = check_bounds
self.trafo_dict = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to be a property on the Transformation class object? It gets overwritten for every item/subroutine that is being processed, so shouldn't really be stored as a state on the object.

Comment on lines +315 to +316
parameters=(stack_size, stack_type_bytes)),
'==', Literal(0))), inline=True, body=(padding,),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something odd is going on with formatting here

parameters=(stack_type_bytes, Literal(8)), kw_parameters=())
if "stack-size" in self.trafo_dict:
scope = Scope()
#TODO: parse_fparser_expression only for fparser ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue it's acceptable to have a hard dependency on this here

@@ -345,12 +358,24 @@ def _get_stack_storage_and_size_var(self, routine, stack_size):
if variables_append:
routine.variables += as_tuple(variables_append)
if body_prepend:
routine.body.prepend(body_prepend)
if not self._insert_stack_at_loki_pragma(routine, body_prepend):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a requirement for this pragma? Can the injection point lookup be improved instead?

@@ -13,7 +13,8 @@
SymbolAttributes, BasicType, DerivedType, Quotient, IntLiteral, LogicLiteral,
Variable, Array, Sum, Literal, Product, InlineCall, Comparison, RangeIndex, Cast,
Intrinsic, Assignment, Conditional, CallStatement, Import, Allocation, Deallocation, is_dimension_constant,
Loop, Pragma, SubroutineItem, FindInlineCalls, Interface, ProcedureSymbol, LogicalNot, dataflow_analysis_attached
Loop, Pragma, SubroutineItem, FindInlineCalls, Interface, ProcedureSymbol, LogicalNot, dataflow_analysis_attached,
expression, parse_fparser_expression, Scope
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to either import the relevant expressions directly (e.g. RangeIndex) or to use the elsewhere commonly used pattern import loki.expression.symbols as sym

"""
Add the pool allocator argument into subroutine calls
"""
call_map = {}
stack_var = self._get_local_stack_var(routine)
for call in FindNodes(CallStatement).visit(routine.body):
if call.name in targets:
if call.name in targets or call.routine.name.lower() in ignore: #['surfpp', 'surfexcdriver', 'surfseb']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug leftover?

@@ -129,7 +131,11 @@ def transform_subroutine(self, routine, **kwargs):

role = kwargs['role']
item = kwargs.get('item', None)
ignore = item.ignore if item else ()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't ignored routines be excluded from targets?

Comment on lines +711 to +740
if call.routine != BasicType.DEFERRED and self.stack_local_var_name not in call.arguments:
if self.stack_argument_name in call.routine.arguments:
arg_idx = call.routine.arguments.index(self.stack_argument_name)
arguments = call.arguments
kwarguments = call.kwarguments
if arg_idx <= len(arguments):
arguments = arguments[:arg_idx] + (stack_var,) + arguments[arg_idx:]
else:
arg_idx -= len(arguments)
kwarguments = kwarguments[:arg_idx] + ((self.stack_argument_name, stack_var),) + \
kwarguments[arg_idx:]
call_map[call] = call.clone(arguments=arguments, kwarguments=kwarguments)
else:
arguments = call.arguments
kwarguments = call.kwarguments
arg_pos = [call.routine.arguments.index(arg) for arg in call.routine.arguments
if arg.type.optional]
if arg_pos:
if arg_pos[0] < len(arguments):
arguments = arguments[:arg_pos[0]] + (stack_var,) + arguments[arg_pos[0]:]
else:
arg_idx = arg_pos[0] - len(arguments)
kwarguments = kwarguments[:arg_idx] + ((self.stack_argument_name, stack_var),) + \
kwarguments[arg_idx:]
else:
if kwarguments:
kwarguments += ((self.stack_argument_name, stack_var),)
else:
arguments += (stack_var,)
call_map[call] = call.clone(arguments=as_tuple(arguments), kwarguments=as_tuple(kwarguments))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here? I suspect this has to do with picking the right injection point when there are keyword arguments present?
That index-fuffing is a bit hard to follow, could this be made more elegant or maybe even outlined into a utility routine (because I think we discussed this interface update problem in a wider context before)?

@MichaelSt98
Copy link
Collaborator Author

Many thanks, this is really a huge step forward and it's fantastic that this works.

However, I think before we bring this in the actual implementation could be improved in a number of ways. I've flagged them in individual comments but will repeat here the most important points in my eyes:

Thanks @reuterbal for having a look and thanks for the comments.

  1. This has a hard requirement on providing per-transformation config values in the toml configuration file. I think now is the right moment to take care of Generic approach for transformation specific settings in SchedulerConfig #28 instead of using the implemented workaround.

I agree. However, for this transformation we need transformation and routine specific config entries. Therefore it's (at least in my opinion) up for discussion what the best way is to do this or rather whether it is like described in Generic approach for transformation specific settings in SchedulerConfig #28.

  1. Some of the implementation details are very opaque. It is entirely unclear to me why there is a need for the loki pragmas to specify certain injection locations. The entire transformation has become a little hard to read due to various additions, which may indicate that some refactoring is overdue.

The Loki pragmas to specify certain injection locations is due to the fact that variables that compose the stack size calculation may not be initialised when the stack size is calculated. For example, the original implementation could (and indeed produces something like this for ecphys), causing either compile time or runtime errors ...

subroutine driver(...)
  type(dimension) :: dims 
  ISTZ = ... klon ...
  ALLOCATE(ZSTACK(ISTSZ)) 
  call kernel(dims%klon, ...)
end subroutine driver

subroutine kernel(klon, ...)
  integer, intent(in) ::  klon
  real temp(klon)
end subroutine kernel()

The simplest solution is to just define klon as variable in the driver and assign the value dims%klon to it in the body of the driver routine. However, this wouldn't solve the problem after all, as the calculation of the stack size would be inserted above klon = dims%klon, which can be solved by having the Loki pragma !$loki stack-insert.

Of course, this can be further automated, however multiple cases can occur, amongst others:

  • variable is a module variable, thus the import would need to be recreated at the driver level
  • variable is a derived type entry or rather more general, variable is renamed in the kernel level's and this renaming has to be resolved somehow
  • variable is a parameter definition at the kernel level and would need to be recreated at the driver level
  • ... 
  1. The injection location for the stack argument seems to require some intricate index fuffing. I can't remember the exact details of the offline discussion but I think the problem is generic enough to justify making this available as a utility transformation.

True. We would/should have some add_args(), remove_args(), check_args() for call statements. However, in our previous offline discussion we assumed to have the relevant information already in the actual routine information, which is not necessarily the case for multi-library Loki transformation. This makes the actual implementation more difficult or at least something that needs to be discussed.

  1. The testing of the pool allocator transformation has become quite messy. I think it would be good to not convolute too many special cases and variations into the same test and instead replicate the tests and check the various cases separately. We can use fixtures to share commonalitites between the tests to avoid redundant code.

Yes 👍

@@ -292,15 +298,22 @@ def _get_stack_storage_and_size_var(self, routine, stack_size):
stack_type_bytes = Cast(name='REAL', expression=Literal(1), kind=_kind)
stack_type_bytes = InlineCall(Variable(name='C_SIZEOF'),
parameters=as_tuple(stack_type_bytes))
stack_type_bytes = InlineCall(function=Variable(name='MAX'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was added to ensure that each entry in the stack is aligned to a 32-byte boundary. I think this only works if the array has at least has 4 entries. This is a much safer assumption than relying on the size being a multiple of NPROMA, but nevertheless something to be aware of.

Also, this should be left on by default but configurable so that for kernels where we know that we don't need to add a floor to the stack entries we can disable it and have a smaller stack size.

@reuterbal
Copy link
Collaborator

Superseeded by #222

@reuterbal reuterbal closed this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants