-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCompRep.tex
144 lines (121 loc) · 6.2 KB
/
CompRep.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
\documentclass[11pt]{article}
\usepackage{graphicx}
\usepackage{cite}
\usepackage{xcolor}
\usepackage{listings}
\def\UrlBreaks{\do\/\do-}
\lstset{language=Fortran,
basicstyle=\small\ttfamily,
%identifierstyle=\color{green},
numbers=left,
commentstyle=\color{red},
keywordstyle=\color{blue},
showstringspaces=false
}
\author{S.~V.~Adams, S.~J.~Cusworth, C.~M.~Maynard, S.~Mullerworth}
\title{Report on LFRic computational performance 2019}
\date{\today}
\begin{document}
\maketitle
\medskip
\section{Introduction\label{sec:intro}}
The LFRic infrastructure is designed to host a dynamical core (Gung Ho)
that is scalable to a very large degree across a distributed memory
computer. This is typically expressed over MPI. LFRic is also designed
to accommodate different programming models to target different
processor architectures. Currently, OpenMP 2.0 is supported for shared
memory parallelism on CPUs. This is being extended to support OpenACC
to target both shared memory parallelism and Instruction Level
Parallelism (ILP) on GPUs. Moreover, the infrastructure has been
extended to support parallel, asynchronous I/O using the XIOS
client/server framework.
Other developments include the abstract solver API which allows for great
flexibility in constructing different solvers, redundant computation
algorithms into halo regions which boost the efficiency of shared
memory and reduce the amount of communication and finally a Multigrid
solver which will enable a large reduction in the cost of global communication.
Much of the model infrastructure is described in~\cite{LFRic}.
The LFRic model uses PSyclone to generate the Parallel System or PSy
layer. Here, data parallelism across the horizontal mesh is used,
exploiting both distributed and shared memory parallelism. This
exploits domain specific information, known to the science developer
and encoded as Metadata in the kernel layer. However, to
fully exploit the ILP on more parallel architectures such as GPUs, it
is necessary to extend PSyclone to fully parse individual Fortran
statements in the Kernel layer and then, for example, annotate the
source code with directives such as OpenACC or OpenMP. This work is just beginning,
however, the micro-benchmark~\cite{lfric-microbenchmarks} suite has been
developed so compute intensive kernels can be used to explore what kernel
optimisation can be developed which can then be generated by PSyclone in the full
model. The PSyclone Kernel Extractor (PSKE) is being developed to
allow the automated extraction of kernels, or more specifically a
driver with a dump file containing the necessary data for the looping
over the horizontal mesh and degrees of freedom without any of the
LFRic infrastructure.
LFRic uses Fortran 2003 Object Orientation programming to support many of the
features described above. This is a very powerful programming
style which aids software development by allowing a clear separation
of concerns between different areas of the code, promoting code re-use
and code development by disentangling dependencies. However, the lack
of compiler support, or more precisely, the proliferation of compiler
bugs which prevent the use of particular compilers, either without specific
code work arounds or even at all is major problem. Development has
been severely delayed whilst compiler bugs have been isolated and
reported and work-arounds sought.
The baroclinic wave test is used to measure the performance of the
model. The solver dominates the run-time. The main computation
costs for this calculation are the local
communication in the form of Halo Exchanges, global communication in
the form of global sums and the user code in the form of kernels for
the various matrix-vector operations. Any science computation will
require the output of various fields, thus I/O also forms a
significant cost.
The strategy to obtain compute performance for LFRic has four main
strands, to address each of the costs. I/O performance is achieved by
using the XIOS capability for asynchronous parallel I/O in order to
hide the cost of writing data. Compute performance is achieved by
enabling scalability, allowing many agents to do computation. This
comes at the cost of communication. Local communication is reduced by
employing a communication avoiding/reducing algorithm of redundant
computation into the halos and exploiting shared memory threaded
parallelism. The Geometric multigrid pre-conditioner to the Helmholtz
solver avoids many global sums by reducing the need for an iterative
solver. The {\em on-node} compute performance of the LFRic code itself
is examined by considering the performance of computationally
intensive kernels on different processor architectures. The LFRic
Microbenchmark suite can be used to develop architecture specific
optimisations. The report is organised into sections on each of these
four strands, {\em viz.} I/O, local communication, global
communication and architecture specific optimisations.
\include{scaling}
\include{multigrid}
\include{annexed_dofs}
\include{io_section}
\include{PA}
\include{PA-cma_microbenchmark}
\section{Conclusions}
I/O
performance of XIOS has been examined and whilst results are promising,
network variability makes definitive statements difficult. Several
avenues of further work have been suggested.
The cost of Halo Exchange for the OpenMP version seems to have
increased dramatically since the Fallow Deer release. Network
variability may play a role
in this change in behaviour. Much
more investigation to find the cause of this issue. However, the
annexed dof transformation does reduce the number of halo exchanges
which reduces this cost.
The Multigrid pre-conditioner shows a large reduction in the total
amount of work, but the reduction in cost of the global sums will
boost the scaling performance as well as reducing the total cost.
The micro-benchmarks suite has been used to explore the performance of
individual kernels on different processor architectures. The
performance of external
libraries and code optimisations have been explored and these can act
as guide for the type of automatic optimisations that will need to be
developed in PSyclone.
For all these performance issues, further work and analysis is
required.
\bibliography{refs.bib}
\bibliographystyle{unsrt}
\end{document}