Reading data, particularly data that will not be utilized in a computation, can
incur susbtantial overhead, particularly if the data is spread over multiple
files on a networked filesystem, where metadata queries can dominate the cost
of IO. yt
takes the approach of building a coarse-grained index based on
the discretization method of the data (particle, grid, octree, unstructured
mesh), combining this with datapoint-level indexing for selection processes.
To supplement this, methods in yt
that process data utilize a system of data
"chunking," whereby segments of data identified during coarse-grained indexing
are subdivided by one of a few different schemes and yielded to the iterating
function; these schemes can include a limited number of tuning parameters or
arguments. These three chunking methods are all
, spatial
and
io
. The all
method simply returns a single, one-dimensional
array, and the number of chunks is always exactly one; this enables both
non-parallel algorithms and simple access to small datasets. spatial
chunking yields three-dimensional arrays. For grid-based datasets,
these are the grids, while for particle and octree datasets they are
leaf-by-leaf collections of particles or mesh values. Optionally, the
spatial
chunking method can return "ghost zones" around regions, for
computation of stencils. The final type of chunking, io
, is designed
to iterate over sets of data in a manner that is most conducive to pipelined
IO. These will not always be load-balanced in size of the returned chunks,
however. In some cases, io
chunking may return one file at a time (in
the case of spreading items across many different files), while in others it
may be returning sub-components of a single file. This chunking type is the
most common strategy for parallel-decomposition.
Necessarily, both indexing and selection methods must be implemented to expose
these different chunking interfaces; yt
utilizes specific methods for each
of the primary data types that it can access. We detail these below,
specifically describing how they are implemented and how they can be improved
in future iterations.
yt
was originally written to support the Enzo code, which is a patch-based
Adaptive Mesh Refinement (AMR) simulation platform. Analysis of grid-based
data is the most frequent application of yt
. While we discuss much of the
techniques implemented for datasets consisting of multiple, potentially
overlapping grids, yt
also supports single-grid datasets (such as FITS
cubes) and is able to decompose them for parallel analysis.
yt
also supports other grid patch codes insert list here
yt
supports several different "features" of patch-based codes. These
include grids that span multiple parent objects, grids that overlap with
coarser data (i.e., AMR), grids that overlap with other grids that provide the
same level of resolution of data (i.e., grids at the same AMR level),
refinement factors that vary based on level, and edge- and vertex-centered
data. For the cases of overlapping grids (either on the same or higher
refinement levels) masks are generated that indicate which data is considered
authoritative.
As noted in Data Objects, the process of selecting points
is multi-step, starting at coarse selection that may be at the file level, and
proceeding to selection of specific data points that are included in a
selector. For grid-based data, the coarse selection stage proceeds in an
extremely simple fashion, by iterating over flat arrays of left and right grid
edges and creating a bitmap of the selected grids. Because this method --
while not taking advantage of any data structures of even mild sophistication
-- is able to take advantage of pipelining and cache-optimization, we have
found that it is sufficiently performant in most geometries up to approximately
Indexing grid data in yt
is optimized for systems of grids that tend to have
larger grid patches, rather than smaller; specifically, in yt
each grid
patch consists of a Python object, which adds a bit of overhead. In the limit
of many more cells than grid objects, this overhead is small, but in cases
where the number of grids is
To address both the memory overhead and the python overhead, as well as more
generally address potential scalability issues with grid selection, we have
begun implementation of a more sophisticated "grid visitors" indexing and
selection method. This draws on the approach used by the oct-visitors
(described below). A spatial tree is
constructed, wherein parent/child relationships are established between grids.
Each process -- selection, copying of data, generation of coordinates -- is
represented by an instance of a GridVisitor
object. The tree is
recursively traversed, and for all selected points the object is called. This
allows grids, their relationships, and the data masks to be stored in
structures and forms that are both optimized and compressed. This method is
essential for scaling to a large number of grid patches; the storage
requirements of a single grid patch Python object are around 1K per object
(about one gigabyte per million grids), whereas the optimized storage reduces
this to approximately 140 bytes (about one gigabyte per eight million grids),
with further reductions possible; for selection operations, we are also able to
reduce the number of temporary arrays and utilize compressed mask
representation, bringing peak memory usage down further. The spatial-tree
optimization substantially increases performance for "wide and shallow" dataset
selection.