Skip to content

Data Types and Holders for UDFs

Paul Rogers edited this page Jan 6, 2018 · 8 revisions

Drill uses holders to pass information to and from UDFs. This page provides reference material for the most commonly used data types and their holders.

Holder Structure

Holders are simple structures. The following is a simplified form of the Float8Holder class:

public final class Float8Holder implements ValueHolder {

  public static final MajorType TYPE = Types.required(MinorType.FLOAT8);
  public static final int WIDTH = 8;

  public double value;
}

The class tells us Drill's internal notation for a required (that is, "non-nullable") FLOAT8. Tells us that the data values are 8-bytes wide. And, most importantly, it gives us the value as a Java double. (There are no getters or setters for the value; code generation does not use them.)

Holder Source

As per Drill's hard-boiled habit in such matters, the holder classes are not documented in online documentation or the code. The source code (and now this page) is the only documentation. Holders are generated, so do a maven build of the Drill sources. Then, look for the holders in the vector project, in the target/generated-sources folder, in the org.apache.drill.exec.expr.holders package.

Table of Holder Classes

The following is a table of the holder classes for each Drill type.

  • In Drill terminology, "Required" means non-nullable, "Optional" means nullable. (Note that there are no holders for array types. UDFs cannot access array fields.)
  • The type names are those used internally in the code in the MinorType enum which is accessible from the TYPE constant in each holder.
  • The name of required holders is <i>TypeName</i>Holder. Example: BigIntHolder.
  • The name of the optional holders is Nullable<i>TypeName</i>Holder. Example: NullableBigIntHolder.
  • Type names in bold are those most commonly used in UDFs.
  • The Java type listed is that of the value member in the holder (where applicable.)
Type Holder Name Java Type Notes
TINYINT TinyInt byte (2)
SMALLINT SmallInt short (2)
INT Int int
BIGINT BigInt long
FLOAT4 Float4 float
FLOAT8 Float8 double
DECIMAL9 Decimal9 int (3, 9)
DECIMAL18 Decimal18 long (3, 9)
DECIMAL28SPARSE Decimal28Sparse ?? (3)
DECIMAL38SPARSE Decimal38Sparse ?? (3)
DECIMAL28DENSE Decimal28Dense ?? (4)
DECIMAL38DENSE Decimal38Dense ?? (4)
MONEY N/A N/A (4)
DATE Date long (5, 9)
TIME Time int (5, 9)
INTERVAL Interval int (9)
INTERVALDAY IntervalDay (various) (9)
INTERVALYEAR IntervalYear (various) (9)
TIMETZ N/A N/A (4)
TIMESTAMPTZ N/A N/A (4)
TIMESTAMP N/A N/A (4)
VARCHAR VarChar N/A (6)
VAR16CHAR Var16Char N/A (6)
VARBINARY VarBinary N/A (6)
BIT N/A N/A (4)
FIXEDCHAR N/A N/A (4)
FIXED16CHAR N/A N/A (4)
FIXEDBINARY N/A N/A (4)
UINT1 UInt1 byte (7, 9)
UINT2 UInt2 char (7, 9)
UINT4 UInt4 int (7, 9)
UINT8 UInt8 long (4, 9)
NULL N/A N/A (4)
LATE N/A N/A (4)
LIST N/A N/A (4)
UNION N/A N/A (4)
MAP N/A N/A (1)
GENERIC_OBJECT Object Object (8, 9)

Notes:

  1. Maps are not accessible from UDFs. Instead, you can project map members to the top level and apply a UDF to that one member:
SELECT log2(`map`.`member`) FROM ...
  1. These types are defined in Drill, but no reader currently produces columns of that type, so they are very seldom used.
  2. Decimal support in Drill is considered experimental and is disabled by default.
  3. These types are declared within Drill, but neither used nor tested. UDFs should not use them.
  4. Dates and times are stored in "pseudo-UTC" as the seconds since the epoch, but in the server's own time zone.
  5. Variable-width types require the use of Drill's direct memory buffer mechanism.
  6. The unsigned integers are used for internal purposes. No readers create such types and so the types are not used in queries.
  7. Generic objects are Java objects used only for system table queries.
  8. See below for how to interpret the value.

Java Primitive Types

As shown in the table above, a number of Drill types map directly into the Java primitive types: TINYINT, SMALLINT, INT, BIGINT, FLOAT4 and FLOAT8. In this case, just directly get or set the value member using the proper Java type. These types are, by far, the easiest to use in UDFs.

DATE Type

The date holder provides a long value which provides the number of milliseconds since the Epoch. Drill dates are not associated with a time zone, so the epoch is relative to whatever timezone the date is defined in. A common mistake is to convert the value to a Drill Date object. Doing so is an error, however, because Date objects are defined as absolute time in UTC, but a Drill date has no associated time zone. (In fact, Drill's own code makes this mistake, which produces amusing results when the client and server are in different time zones.)

When sending a date to the client, the time zone is assumed to be the server's time zone, but is sent as UTC, causing the client to receive a date that is incorrect by the skew between the server time zone and UTC. Your UDF can't fix this, but knowing of this flaw can save you hours of debugging.

Although Drill's logic is technically incorrect, you can use the following code to convert a Drill date value from a long to a Joda DateTime object:

    public static DateTime longDateToDateTime(long value) {
      org.joda.time.DateTime date = new org.joda.time.DateTime(value, org.joda.time.DateTimeZone.UTC);
      return date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault());
    }

(The above is adapted from the getObject() method in the DateVector class.)

TIME Type

The time holder provides the time as an int value field with the time expressed as the number of milliseconds since midnight. Again, Drill times are not associated with a time zone (though Drill code occasionally has a bug that assumes UTC.)

Although Drill's logic is technically incorrect, you can use the following code to convert a Drill date value from a long to a Joda DateTime object:

    public static DateTime intTimeToDateTime(long value) {
      org.joda.time.DateTime time = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC);
      return time.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault());
    }

(The above is adapted from the getObject() method in the TimeVector class.)

Decimal Types

Drill's decimal types are experimental and somewhat awkward and broken. Drill provides an encoding of decimal values, but provides no public API to decode the values.

Decimals are defined by two additional properties: precision and scale. In this case, the TYPE provided by the holder is incorrect, it does not have the precision and scale set. Instead, the values are provided in each holder:

public final class Decimal18Holder implements ValueHolder {
  
    public int scale;
    public int precision;

The values are essentially fixed for each use of your UDF. Precision is the total number of digits in the decimal number and will be less than the length given in the name (such as Decimal9 or Decimal18.) The scale indicates the number of those digits that are past the decimal point.

DECIMAL9

The DECIMAL9 holder provides the value as an int. You can convert the value to a Java BigDecimal as follows:

    public static BigDecimal decimal9ToBigDecimal(int value, int scale) {
      return new BigDecimal(BigInteger.valueOf(value), scale);
    }

(The above is adapted from the getObject() method in the Decimal9Vector class.)

DECIMAL18

The DECIMAL18 holder provides the value as a long. You can convert the value to a Java BigDecimal as follows:

    public static BigDecimal decimal18ToBigDecimal(long value, int scale) {
      return new BigDecimal(BigInteger.valueOf(value), scale);
    }

(The above is adapted from the getObject() method in the Decimal18Vector class.)

Other Decimal Types

The other decimal types have complex encodings that are not documented here because the Drill decimal support is incomplete, buggy and subject to revision. If you must use these types, consult the getObject() methods in the corresponding vector class to learn how to convert the encoded value to a Java BigDecimal object.

Interval Types

The interval types express a duration in time represented as a set of fields. To perform operations on the interval, it is convenient to convert them to a Joda Period object as shown for each type. Unfortunately, the code that does these conversions is deep inside the interval vectors and is not available as an API, so you'll have to roll your own.

INTERVALDAY

The INTERVALDAY type provides a holder with the following fields:

    public int days;
    public int milliseconds;

Convert these to a Period type as follows:

    public static Period intervalDayToPeriod(int days, int millis) {
      final Period p = new Period();
      return p.plusDays(days).plusMillis(millis);
    }

(The above is adapted from the getObject() method in the IntervalDayVector class.)

INTERVALYEAR

The INTERVALYEAR type provides a holder with a single field that gives the duration as number of months:

    public int value;

Convert these to a Period type as follows:

    public static Period intervalYearToPeriod(int value) {
      final int years  = (value / org.apache.drill.exec.expr.fn.impl.DateUtility.yearsToMonths);
      final int months = (value % org.apache.drill.exec.expr.fn.impl.DateUtility.yearsToMonths);
      final Period p = new Period();
      return p.plusYears(years).plusMonths(months);
    }

(The above is adapted from the getObject() method in the IntervalYearVector class.)

INTERVAL

The INTERVAL type provides a holder with the following fields:

    public int months;
    public int days;
    public int milliseconds;

Convert these to a Period type as follows:

    public static Period intervalToPeriod(int months, int days, int millis) {
      final Period p = new Period();
      return p.plusMonths(months).plusDays(days).plusMillis(millis);
    }

(The above is adapted from the getObject() method in the IntervalVector class.)

Unsigned Types

Drill provides four unsigned types: UINT1, UINT2, UINT4 and UINT8. As noted in the table, no readers use these vectors, though they are used internally within Drill. (The UINT1 type sorts the isSet flags for a nullable vector, the UINT4 vector stores the offsets for a repeated or variable-width vector.)

Drill maps the values of the field into a Java primitive type of the same width as shown in the table. But, Java provides only signed integers. So, for example, an unsigned UINT value of 255 becomes a byte value of -1. The exception is that Drill somewhat abuses the char type (unsigned 16 bit value) to store a UINT2.

Fortunately, since no readers use these types, you will probably never have to use them in your UDFs.

Generic Object Type

The GENERIC_OBJECT type is used for system tables. The holder for type is marked as deprecated, so it is not clear if one can write a UDF that works with objects. (Perhaps someone can experiment and report back so we can add details here.)

Variable Width Types

Drill supports three variable-width types: VARCHAR, VAR16CHAR and VARBINARY. The VAR16CHAR is unused, so you really only need work with the other two. VARCHAR represents text and so is very common. VARBINARY represents raw values from HBase and MapR-DB binary, which is handy if you want to write your own function to convert a binary value to some other type.

Because of the complexity of working with these types, they are discussed on a separate page.

Clone this wiki locally