multigrid.tex

\subsection{\label{sec:multigrid}Multigrid}
The multigrid algorithm implemented here is a four-level V-cycle of
the horizontal mesh, and used to pre-condition the Helmholtz or
pressure solver. In fact, the pre-conditioned pressure field is taken
as the ``solved'' outcome and no Krylov solver is applied
to the pressure system. The outer solver of the mixed system uses the
GCR algorithm. The algorithmic details are not discussed here, but a
Krylov sub-space solver performs several global sums per iteration. As
multiple solves per outer iteration are required, each itself of many
iterations there is a reduction of several thousand global sums per
time-step employing multigrid in this manner.

Neither the change to the vertical pre-conditioner, nor the multigrid
solver are on the trunk. There is nothing to suggest that they require
significant alteration, but the process of getting them on trunk has
yet to be completed. Therefore results presented here for the
Multigrid performance should be regarded as preliminary.

\begin{table}
\centering
\caption{\label{tab:MG_data}Wall-clock time of execution of the whole code (WC), User
  code (U), Halo Exchange (HE) and Global Sum (GS) in seconds. T17483
  denotes the trunk at revision 17483, MG-B the CSAR branch,
  Krylov uses a Krylov solver and MG the multigrid pre-conditioner as
  the solver.
}
\begin{tabular}{r|rrrr}
version     & WC     & U      & HE     & GS \\\hline
T17483      & $5398$ & $1851$ & $1333$ & $1657$ \\
MG-B Krylov & $3387$ & $1087$ & $847$  & $1080$ \\
MG-B MG     & $1337$ & $504$  & $388$  & $199$ \\\hline
\end{tabular}
\end{table}

\begin{figure}[ht!]
\centering\includegraphics[width=1.0\linewidth]{figs/mg-improvement.eps}
\caption{\label{fig:mg}Wall-clock time for model components of
  Baroclinic Wave test. ``T'' denotes trunk at 17483, ``Kr'' denotes
  CSAR branch with the Krylov subspace solver and ``Mg'' denotes
  CSAR branch with the multigrid pre-conditioner as the solver.}
\end{figure} 

Shown in figure~\ref{fig:mg}
are the profiling data from CrayPAT for three different model versions
of the Baroclinic wave test on a $C1152$ mesh with a $120$s time-step
for $100$ time-steps. This corresponds to a roughly 9km
resolution. The models were run on $384$ nodes with $6$ MPI ranks per
node and $6$ OpenMP threads per MPI rank.  The CSAR branch is
\verb+r17264_CSAR19_gungho@17480+. The data is also displayed in
Table~\ref{tab:MG_data} for detailed analysis and reference.

Comparing the Krylov solver in the CSAR branch with the modification
to the vertical pre-conditioner shows how efficient the vertical
pre-conditioner is. There is consequently far less work for the solver
to do, both computation and communication. The multigrid
pre-conditioner used as the solver has a further dramatic
effect. Combined together they result in a speed up of $4 \times $ in
wall clock time. Whilst this is a very welcome performance boost it is
the effect on the communication costs which has greater benefit. The
local communication of Halo exchange cost is reduced by roughly
$3.5 \times $. The reduction in the cost of the global sums is very
dramatic, more than $8 \times$. 

The global sum is latency bound and so the cost increases with the
number of MPI ranks taking part. Doing far fewer global sums will
reduce this cost and moreover, reducing the number of global sums will
have a larger effect as the cost of each sum increases with increasing
number of MPI ranks. This reduction in the cost should
allow the model to scale very efficiently to a large number of MPI
ranks. Network adaptivity may account for some of this reduction, or
in fact have lessened the effect. However, as the global sum is quite
susceptible to the network adaptivity, reducing the number of global
sums will reduce the variation in run-times.

Future work will include a proper scaling analysis to a much higher
node count once multigrid is ``on-trunk''. Further tuning of the
algorithms may even yield further improvements. However, these
preliminary results are very promising.