Skip to content

Drill Spill File Format

Paul Rogers edited this page Jun 19, 2017 · 2 revisions

Drill uses spill files in several operators: sort, hash agg (in 1.11) and eventually hash join.

The format of the spill file is internal to Drill, not meant to be understood by other software.

Basic format:

  • Header
  • SV2 (optional)
  • Vectors

The code to read and write the spill file resides in VectorAccessibleSerializable.java.

Header

The header is a Protobuf encoded instance of UserBitShared.RecordBatchDef, defined in UserBitShared.proto as:

message RecordBatchDef {
  optional int32 record_count = 1;
  repeated SerializedField field = 2;
  optional bool carries_two_byte_selection_vector = 3;
}

message SerializedField {
  optional common.MajorType major_type = 1; // the type associated with this field.
  optional NamePart name_part = 2;
  repeated SerializedField child = 3; // only in the cases of type == MAP or REPEAT_MAP or REPEATED_LIST

  optional int32 value_count = 4;
  optional int32 var_byte_length = 5;
  optional int32 buffer_length = 7;
}

Note that the comment about child is wrong. The child list actually contains the "hidden" composite vectors for nullable, variable-width and repeated vectors. The hidden children have names starting with "$".

The above is basically the same information as carried by a VectorContainer, but in Protobuf format. Conversion functions exist to convert between the two formats. The key conversion is to/from a MaterializedField:

    SerializedField metaData = ...;
    MaterializedField field = MaterializedField.create(metaData);

A reader may wish to know the size of the deserialized data. The size can be inferred from the information in the header, adjusted for power-of-two rounding in actual buffer allocations. In practice, the operator that wrote the file is expected to understand the size of the data when read.

Selection Vector 2

The SV2 is present only if the carries_two_byte_selection_vector field is both present and set in the header.

The length of the vector is not serialized. Instead, it is inferred from the record count:

    final int dataLength = recordCount * SelectionVector2.RECORD_SIZE;

The on-disk length of the SV2 is exactly that computed above. But, the in-memory length is subject to power-of-two rounding in the memory allocator.

Vectors

Value vectors are written in the order defined by the header in the field property. Top-level vectors are written and read by the ``VectorAccessibleSerializable` class, but vectors inside of maps are written by the map vector. (This redundancy should be cleaned up.)

For example, for a nullable vector:

[ bits vector | values vector ]

The vectors are written one after the other. Sizes are specified in the header.

In particular, for each buffer:

  • Size is stored in the buffer_length field of the SerializedField object in the header.
  • The size includes the concatenated lengths of all buffers that make up the vector. ** For Required, fixed-width vectors, this is just the data buffer. ** For Required, variable-width vectors, it includes the offset vector. ** For Optional vectors, it includes the "bits" (AKA is-set) vector. ** And so on.
  • The SerializedField metadata gives both a buffer size (buffer_length) and values count (value_count). Code in each vector assures that the two are in agreement given the size of each vector data item.
  • Vector code "slices" the deserialized overall vector buffer into chunks for each composite vector.
  • Offsets to the second, third and later vectors into the vector buffer are computed as the sum of prior buffer lengths.

The ordering of vectors is specified by each vector implementation using the "hidden" fields listed in the child field of SerializedField.

When writing, the buffers are written sequentially. When reading, the entire block of buffers is deserialized into a single, large buffer, then sliced to provide each composite buffer with its data. As a result, the memory size of the deserialized batch will differ from the memory size when written.

Memory Implications

When written, the original record batch has one of two formats:

  • Buffer per vector (for batches created by readers or other operators)
  • Buffer for entire batch (with vectors as slices) (if deserialized from the network)

Thus, the serialization format provides yet another memory format:

  • Buffer per top-level buffer (with composite vectors as slices)

Memory-aware code must therefore understand the three formats. Each stores the same data, but each has internal fragmentation (due to power-of-two rounding) that differ. Thus, the same batch will have three distinct memory sizes depending on the format. This fact can be seen more as a "bug" than a "feature."

Clone this wiki locally