Skip to content

UDFs Background Information

Paul Rogers edited this page Jan 5, 2018 · 31 revisions

Introduction

Drill provides documentation about how to create a User Defined Function (UDF). The information is procedural and walks you through the steps. While this is a great start, some people would like to know what is happening "behind the scenes." That is the topic of this page.

If there is only one message you take away from this page, let it be:


Drill UDFs are NOT Java


Instead, they are use Drill-specific Domain-specific language (DSL) that happens to be expressed in a subset of Java. Use only those Java constructs that Drill specifically allows.

The material here describes the theory of Drill's UDF support so you know what is going on behind the scenes. We then present a simple framework to make UDFs easier to develop and suggest debugging strategies. Finally we present a troubleshooting guide of the many things that will go wrong, what they mean, and how to correct the problems.

To avoid excessive duplication, this page assumes you are familiar with the existing documentation. We'll touch on some sections to offer simpler alternates, but mostly count on the Drill documentation for the basics of setting up a Maven project, etc.

Topics

At first glance, Drill UDFs appear to have an odd structure. After all, Java supports functions and that is all a UDF is, really. But, it seems that Drill UDFs evolved from Hive UDFs, then the design was adjusted to fit the code generation model used within Drill's own operators. The result is a complex interface unique to Drill.

A drawback of Drill's interface is that UDFs are very hard to unit test. (We all unit test our code before bolting it onto Drill, don't we? Good, I thought so.)

Debugging a UDF

The documentation explains how to create a project external to Drill to hold your UDF. This is certainly the form you want to use once your code works. But, to debug your UDF, and to look at the source code referenced here, we have to use an alternative structure temporarily.

Drill provides no API in the normal sense. Instead, Drill provides all sources (it is open source.) Drill assumes that each developer (of a UDF, or storage plugin, etc.) will use the sources needed for that project.

Drill also provides testing tools that we will want to use. But, because of the way has been set up to work with Maven, those tools are available only if your code lives within Drill's java-exec package. (Yes, Drill could use some work to improve it's API. Volunteers?)

So, for our function, we will create the following new Java package within java-exec: org.apache.drill.contrib.udfExample. Here is how:

  • Download and build Drill as explained in the documentation.
  • Using your favorite Git tool, create a new branch from master called udf-example.
  • Use mvn clean install to build Drill from sources.
  • Load Drill into your favorite IDE (IntelliJ or Eclipse.)
  • Within drill-java-exec, under src/main/java, create the org.apache.drill.contrib.udfExample package.
  • Within drill-java-exec, under src/test/java, also create the org.apache.drill.contrib.udfExample package.

Why have we done this? So we can now follow good Test-driven-development (TDD) practice and start with a test. Let's deviate from TDD a bit and create a test that passes using the test framework.

In Eclipse:

  • Select the test package we just created.
  • Choose New → JUnit Test Case.
  • Name: ExampleUdfTest.
  • Superclass: org.apache.drill.test.ClusterTest.
  • Click Finish.

You now have a blank test case. We need to do two things to get started.

First, we must put the Apache copyright at the top of the file. Just pick any other Java file in Drill and copy the copyright notice. (If you forget to do that, Drill's build will fail when next you build from Maven.)

Then, we need a magic bit of code that will start an embedded Drillbit for us. (Later we may want to set config options as shown in org.apache.drill.test.ExampleTest, but for now we'll use the defaults:

public class ExampleUdfTest extends ClusterTest {
  
  @ClassRule
  public static final BaseDirTestWatcher dirTestWatcher = new BaseDirTestWatcher();

  @BeforeClass
  public static void setup() throws Exception {
    startCluster(ClusterFixture.builder(dirTestWatcher));
  }
}

(The need for the dirTestWatcher may be removed in an upcoming commit; you can use the one in a super class.)

Next, let's create a demo test:

  @Test
  public void demoTest() {
    String sql = "SELECT * FROM `cp`.`employee.json` LIMIT 3";
    client.queryBuilder().sql(sql).printCsv();
  }

Run this test as a JUnit test and verify that it does, in fact, print three lines of output. If so, you have verified that that you have a working Drill environment. Also, we now have a handy fixture to use to exercise our UDF as we build it.

Testing the Declaration

Debugging UDFs can often be a black box. Sometimes things work and sometimes they don't It can be hard to know where to look for the problem. One way to reduce the frustration is to test early and often. It both verifies our code and builds our confidence that we are, in fact, on the right path.

Here we will test the annotation just created. This lets us look at the function the way Drill does.

  @Test
  public void testAnnotation() {
    Class<? extends DrillSimpleFunc> fnClass = SinFunction.class;
    FunctionTemplate fnDefn = fnClass.getAnnotation(FunctionTemplate.class);
    assertNotNull(fnDefn);
    assertEquals("sin", fnDefn.name());
    assertEquals(FunctionScope.SIMPLE, fnDefn.scope());
    assertEquals(NullHandling.NULL_IF_NULL, fnDefn.nulls());
  }

The code grabs the class we just created, fetches the annotation, and verifies the three values we set. You can use a variation on this theme to use your debugger (or print statements) to look at the annotation fields.

Parameter and Return Value Types

In general, a UDF is a function of zero or more arguments that returns a single value:

y = sin(x)

Although Drill is schema-free at the level of input files, it turns out Drill is strongly typed internally. As a result, the arguments and return value above must have a declared type. (We'll see later how Drill matches types between your function and the Drill columns stored in value vectors So, we really need something more like:

double y = sin(double x)

Systems such as Javascript or Groovy use introspection to get the information directly from Java.

Drill uses introspection, but with hints based on a set of Drill-defined annotations.

Argument and Return Semantics

The above may seem a bit odd: why are we declaring fields in the class to pass in values to a function? Two reasons.

First, the above structure is overkill for a true function such as this one, but is necessary when we look at aggregate functions.

Second, Drill generates Java code to call each function. Presumably this structure is simpler than a true function call because of the way that Drill optimizes function calls. (More on this topic later also.)

For now, let's just remember to use the argument structure.

Holders

Next we note that the arguments are something called a Float8Holder rather than a Java double. The reason for this is three-fold (which we will explore deeper later):

  • The holder structure is convenient for code generation.
  • The holders can store not just the value but also whether the value is null.
  • Some types (such as VARCHAR) require more than just a simple value.

Different holder types exist for each Drill data type and cardinality (nullable, non-nullable or repeated.) Here is the (abbreviate) definition of the Float8Holder:

public final class Float8Holder implements ValueHolder {

  public static final MajorType TYPE = Types.required(MinorType.FLOAT8);
  public static final int WIDTH = 8;

  public double value;

The class tells us Drill's internal notation for a required (that is, "non-nullable") FLOAT8. Tells us that the data values are 8-bytes wide. And, most importantly, it gives us the value as a Java double. (There are no getters or setters for the value; code generation does not use them.)

So, this looks pretty simple: we get our input value from x.value and we put our return value into out.value. Not quite as easy as using Java semantics, but not hard.

Function Class Methods

Back when we asked the IDE to create our function class, it created two empty methods:

  @Override
  public void setup() { }

  @Override
  public void eval() { }

Because we are writing a simple function (one value in, one value out), we can ignore the setup() method for now. We will instead focus on the one that does the real work: eval(). Let's implement our sin function:

  @Override
  public void eval() {
    out.value = Math.sin(x.value);
  }

We cheated: we just let Java do the real work.

Test the Function

Next we go back to our test class and add a test for the function itself, calling it as Drill does (again, this is not really what Drill does, but hold onto that thought):

  @Test
  public void testFn() {
    SinFunction sinFn = new SinFunction();
    sinFn.setup();
    sinFn.x = new Float8Holder();
    sinFn.out = new Float8Holder();

    sinFn.x.value = Math.PI/2;
    sinFn.eval();
    assertEquals(1.0D, sinFn.out.value, 0.001D);
  }

Simplified Testing

The above is perfectly fine, but tedious. What if we want to test ten different values? To make life easier, we can add test-only methods:

import com.google.common.annotations.VisibleForTesting;
...
  @VisibleForTesting
  public static SinFunction instance() {
    SinFunction fn = new SinFunction();
    fn.x = new Float8Holder();
    fn.out = new Float8Holder();
    fn.setup();
    return fn;
  }
  
  @VisibleForTesting
  public double call(double x) {
    this.x.value = x;
    eval();
    return out.value;
  }

Our test now becomes much simpler:

  @Test
  public void testFn() {
    SinFunction sinFn = SinFunction.instance();

    assertEquals(0D, sinFn.call(0), 0.001D);
    assertEquals(1.0D, sinFn.call(Math.PI/2), 0.001D);
    assertEquals(0, sinFn.call(Math.PI), 0.001D);
    assertEquals(-1.0D, sinFn.call(3 * Math.PI/2), 0.001D);
  }

Much better: we can now easily test all interesting cases.

(If you are following along, you should now experience the beauty of this form of testing: we are always just seconds away from running our next test.)

Testing With Drill

The next step is to test the function with Drill itself. Because our code is within Drill, and we are using a test framework that starts the server, we need only add a test:

Troubleshooting Notes

The section above mentioned that Drill finds UDFs by their annotations and interface. This is only part of the picture. Drill performs a class scan to locate the classes, but scans only selected packages. To identify which, look for this line in the Drill log file:

13:11:07.064 [main] INFO  o.a.d.common.scanner.BuildTimeScan - 
Loaded prescanned packages [org.apache.drill.exec.store.mock,
org.apache.drill.common.logical, 
org.apache.drill.exec.expr,
...] 
from locations [file:/.../drill/exec/java-exec/target/classes/META-INF/drill-module-scan/registry.json,
...]

The above tells us that Drill looks in only selected packages. Thus, when we are experimenting with functions within Drill source code, we must use one of the above. We do not want to attempt to modify the registry.json files as they are very complex and appear to be generated. Drill provides no configuration options to extend the above programmatically. (As we will see, in a production system, we add our own packages to the list. But, we cannot use those mechanisms here.)

The best solution is to use the package org.apache.drill.exec.expr.contrib as a temporary location.

Debugging Class Path Scan Issues

Unfortunately, Drill does not provide logging for the class path scan mechanism. However, if you are using source code, you can insert your own debug code. The place to start is in ClassPathScanner:

    @Override
    public void scan(final Object cls) {
      final ClassFile classFile = (ClassFile)cls;
      System.out.println(classFile.getName()); // Add me

The result will be a long list of all the classes that Drill will scan for function definitions. Check that your class appears. If not, check the package name of the class against the list of scanned packages shown above.

It is also helpful to see which classes and functions are actually being loaded. Again, there is no logging, but you can insert debug statements in LocalFunctionRegistry:

  public List<String> validate(String jarName, ScanResult scanResult) {
    ...
    System.out.println("\n\n-- Function Classes\n\n"); // Add me
    for (AnnotatedClassDescriptor func : providerClasses) {
      System.out.println(func.getClassName()); // Add me
      ...
        for (String name : names) {
          String functionName = name.toLowerCase();
          System.out.println(functionName); // Add me

Forcing a Class Scan

Drill has a very clever mechanism to register functions that builds up a list of functions at build time. Unfortunately, that step is done only by Maven, nor your IDE. So, for your function to be seen by Drill, you must disable class path caching in your tests:

((Need example)).

Add Source to Class Path

Drill uses your function source, not the compiled class file, to run your function. You must ensure that your source code is on the class path, along with the class files. This is a particular problem in Eclipse. You will see an error such as:

((Need example))

To solve it:

  • Run → Debug configurations...
  • Pick your test configuration
  • Classpath tab
  • Select the User Entries node
  • Click Advanced...
  • Select Add Folders and click OK
  • Find the drill-java-exec project in the list
  • Expand this node and select src/main/java
  • Repeat and select target/generated-sources

Non-Annotated Fields Not Allowed

Suppose you were to create a field in your class:

public class Log2Function implements DrillSimpleFunc {
  private double LOG_2;

  public void setup() {
    LOG_2 = Math.log(2.0D);
  }

Both of the above will fail to read your function and will report the following error in the log:

org.apache.drill.exec.expr.udfExample.Log2Function
00:27:16.706 [main] WARN  org.reflections.Reflections - could not scan file /Users/paulrogers/git/drill/exec/java-exec/target/classes/org/apache/drill/exec/expr/udfExample/Log2Function.class with scanner AnnotationScanner
org.reflections.ReflectionsException: could not create class file from Log2Function.class
	at org.reflections.scanners.AbstractScanner.scan(AbstractScanner.java:30) ~[reflections-0.9.8.jar:na]
	at org.reflections.Reflections.scan(Reflections.java:217) [reflections-0.9.8.jar:na]
...
Caused by: java.lang.NullPointerException: null
	at org.apache.drill.common.scanner.ClassPathScanner$AnnotationScanner.getAnnotationDescriptors(ClassPathScanner.java:286) ~[classes/:na]
	at org.apache.drill.common.scanner.ClassPathScanner$AnnotationScanner.scan(ClassPathScanner.java:278) ~[classes/:na]
	at org.reflections.scanners.AbstractScanner.scan(AbstractScanner.java:28) ~[reflections-0.9.8.jar:na]

The cause is DRILL-xxxx which causes a null-pointer exception (NPE) if your class a member variable that does not have an annotation. The cause is an NPE inside the reflections library on access to the unannotated field.

If you see is error, the reason is that you omitted the required @Workspace annotation. The correct form of the above code is:

public class Log2Function implements DrillSimpleFunc {
  @Workspace private double LOG_2;

Displaying Log Output when Debugging

((Show example used to capture logging.))

Static fields (Constants) Not Supported

The the following compiles, but does not work:

public class Log2Function implements DrillSimpleFunc {
  @Workspace public static final double LOG_2 = Math.log(2.0D);

This classic good programming: declare a constant to hold your special numbers. The above works just fine if you test the function outside of Drill. But, when run within Drill, the LOG_2 constant is not set; it defaults to 0, causing the function to return the wrong results.

Alternatives:

  • Place constants in a separate non-function class.
  • Put the values in-line in your code.
  • Use temporary variables in place of constants:
public class Log2Function implements DrillSimpleFunc {
  @Workspace private double LOG_2;

  public void setup() {
    LOG_2 = Math.log(2.0D);
  }

No Same-Package References

You might decide that implementing code in a Drill function is too much of a hassle. Why not simply do the "real work" in a separate class?

public class FunctionImpl {
  private static final double LOG_2 = Math.log(2.0D);
  
  public static final double log2(double x) {
    return Math.log(x) / LOG_2;
  }
}

Then, the Drill function class need only be a thin wrapper:

@FunctionTemplate(
    name = "log2w",
    scope = FunctionScope.SIMPLE,
    nulls = NullHandling.NULL_IF_NULL)

public class Log2Wrapper implements DrillSimpleFunc {

  @Param public Float8Holder x;
  @Output public Float8Holder out;

  @Override
  public void setup() { }

  @Override
  public void eval() {
    out.value = FunctionImpl.log2(x.value);
  }
}

The problem is that Drill does not execute your code. Instead, Drill rewrites your source code. In so doing, Drill moves your code from the package that it was in into Drill's own package for generated code:

package org.apache.drill.exec.test.generated;

...

public class ProjectorGen0 {

    ...

    public void doSetup(FragmentContext context, RecordBatch incoming, RecordBatch outgoing)
        throws SchemaChangeException
    {
        ...
                 
Log2Wrapper_eval: {
    out.value = FunctionImpl.log2(x.value);
}
 
       ...

There is our source code, plunked down inside Drill's generated code. Our reference to FunctionImpl, which was originally in the same class as our code, is not available in Drill's package. The result is a runtime failure:

01:17:39.680 [25b0bf82-1dfc-1c92-8460-3bcb6db74f7c:frag:0:0] ERROR o.a.d.e.r.AbstractSingleRecordBatch - Failure during query
org.apache.drill.exec.exception.SchemaChangeException: Failure while attempting to load generated class
...
Caused by: org.apache.drill.exec.exception.ClassTransformationException: java.util.concurrent.ExecutionException: org.apache.drill.exec.exception.ClassTransformationException: Failure generating transformation classes for value: 

Along with several hundred lines of error messages, including multiple copies of the generated code.

No Imports

We might think the workaround is to put our implementations and wrappers in separate packages:

package org.apache.drill.exec.expr.udfExample;
...
public class Log2Wrapper implements DrillSimpleFunc {

With the implementation one level down:

package org.apache.drill.exec.expr.udfExample.impl;

public class FunctionImpl {
...

We would hope that Drill would copy the imports. But, we'd be wrong; Drill won't do so and the same error as the above will appear.

The only solution is to always use fully qualified class names for all classes except for those in the Java JDK or Drill:

  @Override
  public void eval() {
    out.value = org.apache.drill.exec.expr.udfExample.FunctionImpl.log2(x.value);
  }

Presumably, all this extra work for the developer pays off in slightly faster runtime because, again presumably, Drill can generate better code than Java can (a highly dubious proposition, but there we have it.)

Clone this wiki locally