-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SCC-STACK for ecphys #135
SCC-STACK for ecphys #135
Conversation
Documentation for this branch can be viewed at https://sites.ecmwf.int/docs/loki/135/index.html |
ebe7bd1
to
8641294
Compare
Codecov Report
@@ Coverage Diff @@
## main #135 +/- ##
==========================================
- Coverage 92.14% 92.01% -0.14%
==========================================
Files 89 89
Lines 16563 16599 +36
==========================================
+ Hits 15262 15273 +11
- Misses 1301 1326 +25
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks, this is really a huge step forward and it's fantastic that this works.
However, I think before we bring this in the actual implementation could be improved in a number of ways. I've flagged them in individual comments but will repeat here the most important points in my eyes:
- This has a hard requirement on providing per-transformation config values in the toml configuration file. I think now is the right moment to take care of Generic approach for transformation specific settings in SchedulerConfig #28 instead of using the implemented workaround.
- Some of the implementation details are very opaque. It is entirely unclear to me why there is a need for the loki pragmas to specify certain injection locations. The entire transformation has become a little hard to read due to various additions, which may indicate that some refactoring is overdue.
- The injection location for the stack argument seems to require some intricate index fuffing. I can't remember the exact details of the offline discussion but I think the problem is generic enough to justify making this available as a utility transformation.
- The testing of the pool allocator transformation has become quite messy. I think it would be good to not convolute too many special cases and variations into the same test and instead replicate the tests and check the various cases separately. We can use fixtures to share commonalitites between the tests to avoid redundant code.
@@ -55,6 +55,13 @@ def __init__(self, default, routines, disable=None, dimensions=None, dic2p=None, | |||
self.routines = CaseInsensitiveDict(routines) | |||
else: | |||
self.routines = CaseInsensitiveDict((r.name, r) for r in as_tuple(routines)) | |||
for routine in self.routines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a subtle way of providing transformation config via the config file by storing it as values to a routine with name trafo
? I would argue that we should implement #28 and thus have a proper way of providing such configs rather than using a hacky kind of implementation like this.
definitions=definitions) | ||
stack_size = "400*nz*nlon" | ||
|
||
if stack_size_config == "dict": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for readability it would be better to separate this into different tests.
@@ -121,6 +122,7 @@ def __init__(self, block_dim, | |||
self.local_ptr_var_name_pattern = local_ptr_var_name_pattern | |||
self.directive = directive | |||
self.check_bounds = check_bounds | |||
self.trafo_dict = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really need to be a property on the Transformation class object? It gets overwritten for every item/subroutine that is being processed, so shouldn't really be stored as a state on the object.
parameters=(stack_size, stack_type_bytes)), | ||
'==', Literal(0))), inline=True, body=(padding,), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something odd is going on with formatting here
parameters=(stack_type_bytes, Literal(8)), kw_parameters=()) | ||
if "stack-size" in self.trafo_dict: | ||
scope = Scope() | ||
#TODO: parse_fparser_expression only for fparser ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue it's acceptable to have a hard dependency on this here
@@ -345,12 +358,24 @@ def _get_stack_storage_and_size_var(self, routine, stack_size): | |||
if variables_append: | |||
routine.variables += as_tuple(variables_append) | |||
if body_prepend: | |||
routine.body.prepend(body_prepend) | |||
if not self._insert_stack_at_loki_pragma(routine, body_prepend): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a requirement for this pragma? Can the injection point lookup be improved instead?
@@ -13,7 +13,8 @@ | |||
SymbolAttributes, BasicType, DerivedType, Quotient, IntLiteral, LogicLiteral, | |||
Variable, Array, Sum, Literal, Product, InlineCall, Comparison, RangeIndex, Cast, | |||
Intrinsic, Assignment, Conditional, CallStatement, Import, Allocation, Deallocation, is_dimension_constant, | |||
Loop, Pragma, SubroutineItem, FindInlineCalls, Interface, ProcedureSymbol, LogicalNot, dataflow_analysis_attached | |||
Loop, Pragma, SubroutineItem, FindInlineCalls, Interface, ProcedureSymbol, LogicalNot, dataflow_analysis_attached, | |||
expression, parse_fparser_expression, Scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to either import the relevant expressions directly (e.g. RangeIndex
) or to use the elsewhere commonly used pattern import loki.expression.symbols as sym
""" | ||
Add the pool allocator argument into subroutine calls | ||
""" | ||
call_map = {} | ||
stack_var = self._get_local_stack_var(routine) | ||
for call in FindNodes(CallStatement).visit(routine.body): | ||
if call.name in targets: | ||
if call.name in targets or call.routine.name.lower() in ignore: #['surfpp', 'surfexcdriver', 'surfseb']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug leftover?
@@ -129,7 +131,11 @@ def transform_subroutine(self, routine, **kwargs): | |||
|
|||
role = kwargs['role'] | |||
item = kwargs.get('item', None) | |||
ignore = item.ignore if item else () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't ignored routines be excluded from targets?
if call.routine != BasicType.DEFERRED and self.stack_local_var_name not in call.arguments: | ||
if self.stack_argument_name in call.routine.arguments: | ||
arg_idx = call.routine.arguments.index(self.stack_argument_name) | ||
arguments = call.arguments | ||
kwarguments = call.kwarguments | ||
if arg_idx <= len(arguments): | ||
arguments = arguments[:arg_idx] + (stack_var,) + arguments[arg_idx:] | ||
else: | ||
arg_idx -= len(arguments) | ||
kwarguments = kwarguments[:arg_idx] + ((self.stack_argument_name, stack_var),) + \ | ||
kwarguments[arg_idx:] | ||
call_map[call] = call.clone(arguments=arguments, kwarguments=kwarguments) | ||
else: | ||
arguments = call.arguments | ||
kwarguments = call.kwarguments | ||
arg_pos = [call.routine.arguments.index(arg) for arg in call.routine.arguments | ||
if arg.type.optional] | ||
if arg_pos: | ||
if arg_pos[0] < len(arguments): | ||
arguments = arguments[:arg_pos[0]] + (stack_var,) + arguments[arg_pos[0]:] | ||
else: | ||
arg_idx = arg_pos[0] - len(arguments) | ||
kwarguments = kwarguments[:arg_idx] + ((self.stack_argument_name, stack_var),) + \ | ||
kwarguments[arg_idx:] | ||
else: | ||
if kwarguments: | ||
kwarguments += ((self.stack_argument_name, stack_var),) | ||
else: | ||
arguments += (stack_var,) | ||
call_map[call] = call.clone(arguments=as_tuple(arguments), kwarguments=as_tuple(kwarguments)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on here? I suspect this has to do with picking the right injection point when there are keyword arguments present?
That index-fuffing is a bit hard to follow, could this be made more elegant or maybe even outlined into a utility routine (because I think we discussed this interface update problem in a wider context before)?
Thanks @reuterbal for having a look and thanks for the comments.
I agree. However, for this transformation we need transformation and routine specific config entries. Therefore it's (at least in my opinion) up for discussion what the best way is to do this or rather whether it is like described in Generic approach for transformation specific settings in SchedulerConfig #28.
The Loki pragmas to specify certain injection locations is due to the fact that variables that compose the stack size calculation may not be initialised when the stack size is calculated. For example, the original implementation could (and indeed produces something like this for ecphys), causing either compile time or runtime errors ... subroutine driver(...)
type(dimension) :: dims
ISTZ = ... klon ...
ALLOCATE(ZSTACK(ISTSZ))
call kernel(dims%klon, ...)
end subroutine driver
subroutine kernel(klon, ...)
integer, intent(in) :: klon
real temp(klon)
end subroutine kernel() The simplest solution is to just define Of course, this can be further automated, however multiple cases can occur, amongst others:
True. We would/should have some
Yes 👍 |
@@ -292,15 +298,22 @@ def _get_stack_storage_and_size_var(self, routine, stack_size): | |||
stack_type_bytes = Cast(name='REAL', expression=Literal(1), kind=_kind) | |||
stack_type_bytes = InlineCall(Variable(name='C_SIZEOF'), | |||
parameters=as_tuple(stack_type_bytes)) | |||
stack_type_bytes = InlineCall(function=Variable(name='MAX'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was added to ensure that each entry in the stack is aligned to a 32-byte boundary. I think this only works if the array has at least has 4 entries. This is a much safer assumption than relying on the size being a multiple of NPROMA, but nevertheless something to be aware of.
Also, this should be left on by default but configurable so that for kernels where we know that we don't need to add a floor to the stack entries we can disable it and have a smaller stack size.
Superseeded by #222 |
Implements/Complements SCC Stack transformation for the ecphys.
!$loki stack-insert
to define stack size assignment location in the code, necessary to allow variables used in the stack size expression to be initialised