Skip to content

UDFs Background Information

Paul Rogers edited this page Jan 6, 2018 · 31 revisions

Introduction

Drill provides documentation about how to create a User Defined Function (UDF). The information is procedural and walks you through the steps. While this is a great start, some people would like to know what is happening "behind the scenes." That is the topic of this page.

If there is only one message you take away from this page, let it be:


Drill UDFs are NOT Java


Instead, they are use Drill-specific Domain-specific language (DSL) that happens to be expressed in a subset of Java. Use only those Java constructs that Drill specifically allows.

The material here describes the theory of Drill's UDF support so you know what is going on behind the scenes. We then present a simple framework to make UDFs easier to develop and suggest debugging strategies. Finally we present a troubleshooting guide of the many things that will go wrong, what they mean, and how to correct the problems.

To avoid excessive duplication, this page assumes you are familiar with the existing documentation. We'll touch on some sections to offer simpler alternates, but mostly count on the Drill documentation for the basics of setting up a Maven project, etc.

Topics

At first glance, Drill UDFs appear to have an odd structure. After all, Java supports functions and that is all a UDF is, really. But, it seems that Drill UDFs evolved from Hive UDFs, then the design was adjusted to fit the code generation model used within Drill's own operators. The result is a complex interface unique to Drill.

Argument and Return Semantics

The above may seem a bit odd: why are we declaring fields in the class to pass in values to a function? Two reasons.

First, the above structure is overkill for a true function such as this one, but is necessary when we look at aggregate functions.

Second, Drill generates Java code to call each function. Presumably this structure is simpler than a true function call because of the way that Drill optimizes function calls. (More on this topic later also.)

For now, let's just remember to use the argument structure.

Parameter and Return Value Types

In general, a UDF is a function of zero or more arguments that returns a single value:

y = sin(x)

Although Drill is schema-free at the level of input files, it turns out Drill is strongly typed internally. As a result, the arguments and return value above must have a declared type. (We'll see later how Drill matches types between your function and the Drill columns stored in value vectors So, we really need something more like:

double y = sin(double x)

Systems such as Javascript or Groovy use introspection to get the information directly from Java.

Drill uses introspection, but with hints based on a set of Drill-defined annotations.

Holders

Next we note that the arguments are something called a Float8Holder rather than a Java double. The reason for this is three-fold (which we will explore deeper later):

  • The holder structure is convenient for code generation.
  • The holders can store not just the value but also whether the value is null.
  • Some types (such as VARCHAR) require more than just a simple value.

Different holder types exist for each Drill data type and cardinality (nullable, non-nullable or repeated.) Here is the (abbreviate) definition of the Float8Holder:

public final class Float8Holder implements ValueHolder {

  public static final MajorType TYPE = Types.required(MinorType.FLOAT8);
  public static final int WIDTH = 8;

  public double value;

The class tells us Drill's internal notation for a required (that is, "non-nullable") FLOAT8. Tells us that the data values are 8-bytes wide. And, most importantly, it gives us the value as a Java double. (There are no getters or setters for the value; code generation does not use them.)

So, this looks pretty simple: we get our input value from x.value and we put our return value into out.value. Not quite as easy as using Java semantics, but not hard.

Function Class Methods

Back when we asked the IDE to create our function class, it created two empty methods:

  @Override
  public void setup() { }

  @Override
  public void eval() { }

Because we are writing a simple function (one value in, one value out), we can ignore the setup() method for now. We will instead focus on the one that does the real work: eval(). Let's implement our sin function:

  @Override
  public void eval() {
    out.value = Math.sin(x.value);
  }

We cheated: we just let Java do the real work.

Clone this wiki locally