Skip to content

Commit 65c1b40

Browse files
committed
Some implementation details for Castro retries
1 parent b5e08eb commit 65c1b40

File tree

2 files changed

+66
-2
lines changed

2 files changed

+66
-2
lines changed

retries/paper.tex

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ \section{Introduction}\label{Sec:Introduction}
6262
\item Zones whose local velocity exceeds the CFL hydrodynamic stability criterion given the selected timestep
6363
\item Negative density in a zone after a hydrodynamic update
6464
\end{itemize}
65-
All of these failure modes (and more) are represented in the simulation code Castro \citep{castro_joss}.
65+
All of these failure modes (and more) are represented in the simulation code \castro\ \citep{castro_joss}.
6666
In our experience, any useful science simulation will almost inevitably hit one of these failure modes.
6767

6868
There are (at least) three possible responses to such failures. First, the application can abort and tell the user
@@ -78,7 +78,35 @@ \section{Introduction}\label{Sec:Introduction}
7878
that encounters one of these failure modes. Typically one will use a smaller $\Delta t$ on the retry, although
7979
other options exist, such as modifying the spatial resolution when using an adaptive mesh. One astrophysics
8080
application that successfully does this, which we drew inspiration from for this work, is the stellar evolution
81-
code MESA \citep{MESA}.
81+
code MESA \citep{MESA}. MESA also has the notion of a ``backup'': a retry can be defined as a second attempt at
82+
the current timestep, while a backup involves returning to an earlier timestep.
83+
84+
\section{Implementation Details}\label{Sec:Implementation}
85+
86+
A functional retry mechanism requires a decision scheme and associated code infrastructure for when to perform
87+
a second attempt at the timestep. A retry scheme effectively turns a timestep into a \texttt{while} loop: attempt
88+
an advance by $\Delta t$, and if it fails then attempt an advance by $\Delta t / 2$, and keep going until you have
89+
reached $t_{i} + \Delta t$ where $t_{i}$ is the starting simulation time. So in both principle and practice,
90+
implementing a retry mechanism is actually quite straightforward, with two complications as noted next.
91+
92+
First, depending on context, this scheme may require saving application state and then restoring that application
93+
state at a later point. For example, \castro\ is a multi-level mesh refinement code where the fine levels are
94+
subcycled with respect to the coarse levels \citep{castro,berger_colella}, and so fine levels typically need to
95+
interpolate coarse data between time levels $t_{i}$ and $t_{i} + \Delta t$, where $\Delta t$ is the coarse grid
96+
timestep. \castro\ does this by maintaining both an ``old'' state and a ``new'' state corresponding to those two
97+
time levels. But subcycling within a given level due to retries will mean that at the end of the coarse step, the
98+
``old'' state data will be at some later time, such as $t_{i} + 3\Delta t / 4$ in the case of two failures. This
99+
needs to be reset back to the initial ``old'' state at $t_{i}$ for valid interpolation onto the fine grid. In \castro\
100+
we handle this by making a backup copy of the data\footnote{For the GPU build of \castro, we save this backup copy
101+
in host pinned memory since we are typically GPU memory constrained} at $t_{i}$ if (and only if) we encounter an
102+
advance failure on the first try.
103+
104+
Second, this scheme may involve refactoring physics modules that perform irrevocable changes to application
105+
state. One way to guarantee that the scheme works is for the sequence of physics modules corresponding to an
106+
advance to be written as an idempotent operator: given an input ``old'' state at $t_{i}$, the advance updates
107+
only the ``new'' state at $t_{i} + \Delta t$, and only in a way that if you re-run the timestep you get the same
108+
result in the ``new'' state. This makes a retry scheme trivial: if you detect a failure, exit from the advance
109+
immediately, and then retry the advance with a smaller timestep.
82110

83111
%======================================================================
84112
% References

retries/ws.bib

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,27 @@ @article{castro_joss
1212
journal = {Journal of Open Source Software}
1313
}
1414

15+
@article{castro,
16+
author = {Almgren, A. S. and Beckner, V. E. and Bell, J. B. and Day, M. S. and Howell, L. H. and Joggerst, C. C. and Lijewski, M. J. and Nonaka, A. and Singer, M. and Zingale, M.},
17+
title = {Castro: A New Compressible Astrophysical Solver. I. Hydrodynamics And Self-Gravity},
18+
journal = {ApJ},
19+
archiveprefix = {arXiv},
20+
eprint = {1005.0114},
21+
keywords = {equation of state, gravitation, hydrodynamics, methods: numerical, nuclear reactions, nucleosynthesis, abundances},
22+
year = {2010},
23+
month = may,
24+
volume = {715},
25+
pages = {1221-1238},
26+
doi = {10.1088/0004-637x/715/2/1221},
27+
adsurl = {http://adsabs.harvard.edu/abs/2010ApJ...715.1221A},
28+
adsnote = {Provided by the SAO/NASA Astrophysics Data System},
29+
number = {2},
30+
source = {Crossref},
31+
url = {https://doi.org/10.1088/0004-637x/715/2/1221},
32+
publisher = {American Astronomical Society},
33+
issn = {0004-637X, 1538-4357},
34+
}
35+
1536
@ARTICLE{positivity_preserving,
1637
author = {{Hu}, Xiangyu Y. and {Adams}, Nikolaus A. and {Shu}, Chi-Wang},
1738
title = "{Positivity-preserving method for high-order conservative schemes solving compressible Euler equations}",
@@ -47,3 +68,18 @@ @ARTICLE{MESA
4768
adsurl = {https://ui.adsabs.harvard.edu/abs/2011ApJS..192....3P},
4869
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
4970
}
71+
72+
@ARTICLE{berger_colella,
73+
author = {{Berger}, M.~J. and {Colella}, P.},
74+
title = "{Local Adaptive Mesh Refinement for Shock Hydrodynamics}",
75+
journal = {Journal of Computational Physics},
76+
keywords = {Conservation Laws, Grid Generation (Mathematics), Hydrodynamics, Shock Waves, Algorithms, Boundary Integral Method, Discontinuity, Error Analysis, Euler Equations Of Motion, Numerical Analysis, Fluid Mechanics and Heat Transfer},
77+
year = 1989,
78+
month = may,
79+
volume = {82},
80+
number = {1},
81+
pages = {64-84},
82+
doi = {10.1016/0021-9991(89)90035-1},
83+
adsurl = {https://ui.adsabs.harvard.edu/abs/1989JCoPh..82...64B},
84+
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
85+
}

0 commit comments

Comments
 (0)