-
Notifications
You must be signed in to change notification settings - Fork 979
Data Types and Holders for UDFs
Drill uses holders to pass information to and from UDFs. This page provides reference material for the most commonly used data types and their holders.
Holders are simple structures. The following is a simplified form of the Float8Holder
class:
public final class Float8Holder implements ValueHolder {
public static final MajorType TYPE = Types.required(MinorType.FLOAT8);
public static final int WIDTH = 8;
public double value;
}
The class tells us Drill's internal notation for a required (that is, "non-nullable") FLOAT8
. Tells us that the data values are 8-bytes wide. And, most importantly, it gives us the value as a Java double. (There are no getters or setters for the value; code generation does not use them.)
As per Drill's hard-boiled habit in such matters, the holder classes are not documented in online documentation or the code. The source code (and now this page) is the only documentation. Holders are generated, so do a maven build of the Drill sources. Then, look for the holders in the vector
project, in the target/generated-sources
folder, in the org.apache.drill.exec.expr.holders
package.
The following is a table of the holder classes for each Drill type.
- In Drill terminology, "Required" means non-nullable, "Optional" means nullable. (Note that there are no holders for array types. UDFs cannot access array fields.)
- The type names are those used internally in the code in the
MinorType
enum which is accessible from theTYPE
constant in each holder. - The name of required holders is
<i>TypeName</i>Holder
. Example:BigIntHolder
. - The name of the optional holders is
Nullable<i>TypeName</i>Holder
. Example:NullableBigIntHolder
. - Type names in bold are those most commonly used in UDFs.
- The Java type listed is that of the
value
member in the holder (where applicable.)
Type | Holder Name | Java Type | Notes |
---|---|---|---|
TINYINT | TinyInt | byte | (2) |
SMALLINT | SmallInt | short | (2) |
INT | Int | int | |
BIGINT | BigInt | long | |
FLOAT4 | Float4 | float | |
FLOAT8 | Float8 | double | |
DECIMAL9 | Decimal9 | int | (3, 9) |
DECIMAL18 | Decimal18 | long | (3, 9) |
DECIMAL28SPARSE | Decimal28Sparse | ?? | (3) |
DECIMAL38SPARSE | Decimal38Sparse | ?? | (3) |
DECIMAL28DENSE | Decimal28Dense | ?? | (4) |
DECIMAL38DENSE | Decimal38Dense | ?? | (4) |
MONEY | N/A | N/A | (4) |
DATE | Date | long | (5, 9) |
TIME | Time | int | (5, 9) |
INTERVAL | Interval | int | (9) |
INTERVALDAY | IntervalDay | (various) | (9) |
INTERVALYEAR | IntervalYear | (various) | (9) |
TIMETZ | N/A | N/A | (4) |
TIMESTAMPTZ | N/A | N/A | (4) |
TIMESTAMP | N/A | N/A | (4) |
VARCHAR | VarChar | N/A | (6) |
VAR16CHAR | Var16Char | N/A | (6) |
VARBINARY | VarBinary | N/A | (6) |
BIT | N/A | N/A | (4) |
FIXEDCHAR | N/A | N/A | (4) |
FIXED16CHAR | N/A | N/A | (4) |
FIXEDBINARY | N/A | N/A | (4) |
UINT1 | UInt1 | byte | (7, 9) |
UINT2 | UInt2 | char | (7, 9) |
UINT4 | UInt4 | int | (7, 9) |
UINT8 | UInt8 | long | (4, 9) |
NULL | N/A | N/A | (4) |
LATE | N/A | N/A | (4) |
LIST | N/A | N/A | (4) |
UNION | N/A | N/A | (4) |
MAP | N/A | N/A | (1) |
GENERIC_OBJECT | Object | Object | (8, 9) |
Notes:
- Maps are not accessible from UDFs. Instead, you can project map members to the top level and apply a UDF to that one member:
SELECT log2(`map`.`member`) FROM ...
- These types are defined in Drill, but no reader currently produces columns of that type, so they are very seldom used.
- Decimal support in Drill is considered experimental and is disabled by default.
- These types are declared within Drill, but neither used nor tested. UDFs should not use them.
- Dates and times are stored in "pseudo-UTC" as the seconds since the epoch, but in the server's own time zone.
- Variable-width types require the use of Drill's direct memory buffer mechanism.
- The unsigned integers are used for internal purposes. No readers create such types and so the types are not used in queries.
- Generic objects are Java objects used only for system table queries.
- See below for how to interpret the value.
As shown in the table above, a number of Drill types map directly into the Java primitive types: TINYINT
, SMALLINT
, INT
, BIGINT
, FLOAT4
and FLOAT8
. In this case, just directly get or set the value
member using the proper Java type. These types are, by far, the easiest to use in UDFs.
The date holder provides a long
value which provides the number of milliseconds since the Epoch. Drill dates are not associated with a time zone, so the epoch is relative to whatever timezone the date is defined in. A common mistake is to convert the value to a Drill Date
object. Doing so is an error, however, because Date
objects are defined as absolute time in UTC, but a Drill date has no associated time zone. (In fact, Drill's own code makes this mistake, which produces amusing results when the client and server are in different time zones.)
When sending a date to the client, the time zone is assumed to be the server's time zone, but is sent as UTC, causing the client to receive a date that is incorrect by the skew between the server time zone and UTC. Your UDF can't fix this, but knowing of this flaw can save you hours of debugging.
Although Drill's logic is technically incorrect, you can use the following code to convert a Drill date value from a long
to a Joda DateTime
object:
public static DateTime longDateToDateTime(long value) {
org.joda.time.DateTime date = new org.joda.time.DateTime(value, org.joda.time.DateTimeZone.UTC);
return date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault());
}
(The above is adapted from the getObject()
method in the DateVector
class.)
The time holder provides the time as an int
value
field with the time expressed as the number of milliseconds since midnight. Again, Drill times are not associated with a time zone (though Drill code occasionally has a bug that assumes UTC.)
Although Drill's logic is technically incorrect, you can use the following code to convert a Drill date value from a long
to a Joda DateTime
object:
public static DateTime intTimeToDateTime(long value) {
org.joda.time.DateTime time = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC);
return time.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault());
}
(The above is adapted from the getObject()
method in the TimeVector
class.)
Drill's decimal types are experimental and somewhat awkward and broken. Drill provides an encoding of decimal values, but provides no public API to decode the values.
Decimals are defined by two additional properties: precision and scale. In this case, the TYPE
provided by the holder is incorrect, it does not have the precision and scale set. Instead, the values are provided in each holder:
public final class Decimal18Holder implements ValueHolder {
public int scale;
public int precision;
The values are essentially fixed for each use of your UDF. Precision is the total number of digits in the decimal number and will be less than the length given in the name (such as Decimal9
or Decimal18
.) The scale indicates the number of those digits that are past the decimal point.
The DECIMAL9
holder provides the value as an int
. You can convert the value to a Java BigDecimal
as follows:
public static BigDecimal decimal9ToBigDecimal(int value, int scale) {
return new BigDecimal(BigInteger.valueOf(value), scale);
}
(The above is adapted from the getObject()
method in the Decimal9Vector
class.)
The DECIMAL18
holder provides the value as a long
. You can convert the value to a Java BigDecimal
as follows:
public static BigDecimal decimal18ToBigDecimal(long value, int scale) {
return new BigDecimal(BigInteger.valueOf(value), scale);
}
(The above is adapted from the getObject()
method in the Decimal18Vector
class.)
The other decimal types have complex encodings that are not documented here because the Drill decimal support is incomplete, buggy and subject to revision. If you must use these types, consult the getObject()
methods in the corresponding vector class to learn how to convert the encoded value to a Java BigDecimal
object.
The interval types express a duration in time represented as a set of fields. To perform operations on the interval, it is convenient to convert them to a Joda Period
object as shown for each type. Unfortunately, the code that does these conversions is deep inside the interval vectors and is not available as an API, so you'll have to roll your own.
The INTERVALDAY
type provides a holder with the following fields:
public int days;
public int milliseconds;
Convert these to a Period
type as follows:
public static Period intervalDayToPeriod(int days, int millis) {
final Period p = new Period();
return p.plusDays(days).plusMillis(millis);
}
(The above is adapted from the getObject()
method in the IntervalDayVector
class.)
The INTERVALYEAR
type provides a holder with a single field that gives the duration as number of months:
public int value;
Convert these to a Period
type as follows:
public static Period intervalYearToPeriod(int value) {
final int years = (value / org.apache.drill.exec.expr.fn.impl.DateUtility.yearsToMonths);
final int months = (value % org.apache.drill.exec.expr.fn.impl.DateUtility.yearsToMonths);
final Period p = new Period();
return p.plusYears(years).plusMonths(months);
}
(The above is adapted from the getObject()
method in the IntervalYearVector
class.)
The INTERVAL
type provides a holder with the following fields:
public int months;
public int days;
public int milliseconds;
Convert these to a Period
type as follows:
public static Period intervalToPeriod(int months, int days, int millis) {
final Period p = new Period();
return p.plusMonths(months).plusDays(days).plusMillis(millis);
}
(The above is adapted from the getObject()
method in the IntervalVector
class.)
Drill provides four unsigned types: UINT1
, UINT2
, UINT4
and UINT8
. As noted in the table, no readers use these vectors, though they are used internally within Drill. (The UINT1
type sorts the isSet
flags for a nullable vector, the UINT4
vector stores the offsets for a repeated or variable-width vector.)
Drill maps the values of the field into a Java primitive type of the same width as shown in the table. But, Java provides only signed integers. So, for example, an unsigned UINT
value of 255 becomes a byte
value of -1. The exception is that Drill somewhat abuses the char
type (unsigned 16 bit value) to store a UINT2
.
Fortunately, since no readers use these types, you will probably never have to use them in your UDFs.
The GENERIC_OBJECT
type is used for system tables. The holder for type is marked as deprecated, so it is not clear if one can write a UDF that works with objects. (Perhaps someone can experiment and report back so we can add details here.)
Drill supports three variable-width types: VARCHAR
, VAR16CHAR
and VARBINARY
. The VAR16CHAR
is unused, so you really only need work with the other two. VARCHAR
represents text and so is very common. VARBINARY
represents raw values from HBase and MapR-DB binary, which is handy if you want to write your own function to convert a binary value to some other type.
Because of the complexity of working with these types, they are discussed on a separate page.