Optimization of Filter Performance

Building on the work done to optimize sort performance, this page looks at optimizing filter performance using the same techniques.

Baseline

The test uses the same mock data generator (DRILL-5152, PR 708 to produce 10 million rows with one integer each. Data is randomly uniformly distributed in the Java int range. We use a WHERE clause to eliminate roughly half the rows:

SELECT id_i FROM `mock`.employee_50M WHERE id_i > 0

The above means to create an integer column (_i) in a table of 50 million rows (_50m). As it turns out, using only 10 million rows results in run times that are too quick to be easily analyzed.

We use the test framework from DRILL-5126, PR 710 to run timed tests using Oracle Java 8 on a Macbook Pro with an Intel i5 processor. The mock data source produces batches of about 32K in size (adjusted down from the default 256K.) This gives about 8K records per batch and about 6100 batches passed to the filter. All tests were run from within Eclipse, using the Run option (not debug) using an embedded Drillbit. Parallelism is set 1 one so that we run only one fragment per query.

The baseline runs the test five times, takes the average of all but the first, and prints details for the second-to-last:

Read  24992460 records in 6104 batches.
...
Avg run time: 1784
Op: 0 Screen
  Setup:   0 - 0%, 0%
  Process: 11 - 0%, 0%
  Wait:    26
Op: 1 Project
  Setup:   0 - 0%, 0%
  Process: 6 - 0%, 0%
Op: 2 SelectionVectorRemover
  Setup:   1 - 25%, 0%
  Process: 189 - 12%, 12%
Op: 3 Filter
  Setup:   3 - 75%, 0%
  Process: 575 - 38%, 38%
Op: 4 Scan
  Setup:   0 - 0%, 0%
  Process: 709 - 47%, 47%
Total:
  Setup:   4
  Process: 1490

In the above, times are in ms. The two percentages are percentage of the category (setup or process) and percent of total run time. The average time is as measured from the client. The process + setup time is the total time seen in executing the one fragment for the query.

Observations:

Selection vector remover setup takes 12% of the run time.
Filter takes 38% of the run time.
We are already fairly efficient because the mock scan (the bare minimum data source, no disk I/O) takes 47% of the run time.

Easy First Steps

The first thing to do is to apply the lessons already learned. Let's switch to plain Java code generation with the JDK.

Results:

Read  25003415 records in 6104 batches.
...
Avg run time: 1724
...
Op: 2 SelectionVectorRemover
  Setup:   1 - 25%, 0%
  Process: 178 - 12%, 12%
Op: 3 Filter
  Setup:   3 - 75%, 0%
  Process: 498 - 36%, 35%
Op: 4 Scan
  Setup:   0 - 0%, 0%
  Process: 690 - 49%, 49%
Total:
  Setup:   4
  Process: 1382

This got us a very small gain: about 60ms or about 3%. The important reason for the switch is that doing so allows us to hand-optimize the filter code.

Generated Code

Next, let's use the "plain Java" feature mentioned above to capture the generated filter code (with extra fluff removed), and the outer loop in the template:

public abstract class FilterTemplate2 implements Filterer {

public class FiltererGen0 extends FilterTemplate2 {
...
  private void filterBatchNoSV(int recordCount) throws SchemaChangeException {
    int svIndex = 0;
    for(int i = 0; i < recordCount; i++){
      if(doEval(i, 0)){
        outgoingSelectionVector.setIndex(svIndex, (char)i);
        svIndex++;
      }
    }
    outgoingSelectionVector.setRecordCount(svIndex);
  }
...
    public boolean doEval(int inIndex, int outIndex) throws SchemaChangeException
    {
             IntHolder out3 = new IntHolder();
                out3 .value = vv0 .getAccessor().get((inIndex));
             BitHolder out6 = new BitHolder();
                final BitHolder out = new BitHolder();
                IntHolder left = out3;
                IntHolder right = constant5;
                 
GCompareIntVsInt$GreaterThanIntVsInt_eval: {
    out.value = left.value > right.value ? 1 : 0; }
                out6 = out;
            return (out6 .value == 1);
    }

    public void doSetup(FragmentContext context, RecordBatch incoming, RecordBatch outgoing)
        throws SchemaChangeException {
            int[] fieldIds1 = new int[ 1 ] ;
            fieldIds1 [ 0 ] = 0;
            Object tmp2 = (incoming).getValueAccessorById(IntVector.class, fieldIds1).getValueVector();
            if (tmp2 == null) {
                throw new SchemaChangeException("Failure while loading vector vv0 with id: TypedFieldId [fieldIds=[0], remainder=null].");
            }
            vv0 = ((IntVector) tmp2);
            IntHolder out4 = new IntHolder();
            out4 .value = 0;
            constant5 = out4;
                IntHolder right = constant5;
    }
}

The above suggests we can play the same tricks as for the sort operator:

Replace vector code and holders with direct calls to the underlying buffers.
Combine the template filterBatchNoSV with the generated doEval methods.

Preparing the Experiment

We now create a hand-crafted version of the generated code by copying the generated code into the Drill project and using it instead of the generated code in the FilterRecordBatch:

//      final Filterer filter = context.getImplementationClass(codeGen);
      final Filterer filter = new FilterExp();

A quick sanity check shows no change in run time, as we'd expect: 1770 ms.

Remove Holders

The above code generates temporary "holder" objects. We saw earlier evidence that the holders are optimized away by the JVM. And, the normal Drill byte manipulation path does "scalar replacement" to remove them. Still, let's try. The new code for doEval():

      return vv0 .getAccessor().get((inIndex)) > 0;

Results:

Read  24997276 records in 6104 batches.
Avg run time: 1795
...
Op: 2 SelectionVectorRemover
  Setup:   1 - 25%, 0%
  Process: 176 - 12%, 12%
Op: 3 Filter
  Setup:   3 - 75%, 0%
  Process: 540 - 37%, 37%
Op: 4 Scan
  Setup:   0 - 0%, 0%
  Process: 694 - 48%, 48%
Total:
  Setup:   4
  Process: 1427

Average run time is basically unchanged (allowing for noise.) This suggests that Java does, in fact, do scalar replacement. No savings, but it puts us in a better position to do the next trick: work directly with the data buffer.

Directly Access the Data Buffer

The next step is to directly access the value vector memory, cutting out the middlemen of the value vector, the accessor and the DrillBuf, removing 6 function calls (counting those involved in bounds checking).

    private long addr0;
    public void doSetup(...) {
            addr0 = vv0.getBuffer().memoryAddress();
    ...
    public boolean doEval(int inIndex, int outIndex) {
    {
      int value = PlatformDependent.getInt(addr0 + (inIndex<<2));
      return value > 0;
    }

This saves about 170 ms or 9% overall. Time in the filter itself drops by 39%. (The filter time is measure from the one specific run shown, overall runtime is an average.)

Read  24996095 records in 6104 batches.
Avg run time: 1619
...
Op: 2 SelectionVectorRemover
  Setup:   1 - 25%, 0%
  Process: 201 - 15%, 15%
Op: 3 Filter
  Setup:   3 - 75%, 0%
  Process: 330 - 25%, 25%
Op: 4 Scan
  Setup:   0 - 0%, 0%
  Process: 755 - 57%, 57%
Total:
  Setup:   4
  Process: 1307

Combine Evaluation and Loop

The above code still calls doEval() for every value: all 50 million of them. Let's move the loop into the (hand-crafted) generated code:

      for(int i = 0; i < recordCount; i++){
        int value = PlatformDependent.getInt(addr0 + (i<<2));
        if(value > 0){
          outgoingSelectionVector.setIndex(svIndex, (char)i);
          svIndex++;
        }

Surprisingly, this saves no time. Actually, it is understandable: the doEval() method is not so simple the JVM can do he optimization automatically, so doing it by hand adds nothing. Not, however, that in production code we must combine the two functions: the JVM can optimize this call only because the template ever calls one implementation; the optimization will likely fail once multiple subclasses are used.

Read  25001261 records in 6104 batches.
Avg run time: 1611
...
Op: 3 Filter
  Setup:   3 - 75%, 0%
  Process: 324 - 24%, 24%

Direct Access to Selection Vector Memory

We can play the same trick with the selection vector as with the data vector: work directly with the memory backing the selection vector.

    protected void filterBatchNoSV(int recordCount) throws SchemaChangeException {
//      long addrSV = outgoingSelectionVector.getBuffer().memoryAddress();
      long addrSV = outgoingSelectionVector.getDataAddr();
      int svIndex = 0;
      for(int i = 0; i < recordCount; i++){
        if(doEval(i, 0)){
          PlatformDependent.putShort(addrSV + (svIndex<<1), (short) i);
          svIndex++;
        }
      }

Note a "gotcha" in the above. For value vectors, getBuffer() simply returns the DrillBuf. But for some silly reason, the same method on selection vectors clears the buffer before returning it. Fun for the whole family tracking down that error...

The results are that run time drops another 121 ms. or 8%.

Read  24996154 records in 6104 batches.
Avg run time: 1490
...
Op: 2 SelectionVectorRemover
  Setup:   1 - 25%, 0%
  Process: 182 - 15%, 15%
Op: 3 Filter
  Setup:   3 - 75%, 0%
  Process: 263 - 22%, 22%
Op: 4 Scan
  Setup:   0 - 0%, 0%
  Process: 714 - 60%, 60%
Total:
  Setup:   4
  Process: 1175

Final Tweaks

The above was done with the doEval() method in place. Let's again merge the doEval() code and do number of other minor optimizations:

    @Override
    protected void filterBatchNoSV(int recordCount) throws SchemaChangeException {
      long addr0 = vv0.getBuffer().memoryAddress();
      long addrSV = outgoingSelectionVector.getDataAddr();
      int svIndex = 0;
      for(int i = 0; i < recordCount; i++) {
        int value = PlatformDependent.getInt(addr0 + (i<<2));
        if(value > 0) {
          PlatformDependent.putShort(addrSV + (svIndex<<1), (short) i);
          svIndex++;
        }
      }
      outgoingSelectionVector.setRecordCount(svIndex);
    }

This final change provides no real benefit:

Read  25000601 records in 6104 batches.
Avg run time: 1653
...
Op: 2 SelectionVectorRemover
  Setup:   1 - 33%, 0%
  Process: 214 - 15%, 15%
Op: 3 Filter
  Setup:   2 - 66%, 0%
  Process: 273 - 20%, 20%
Op: 4 Scan
  Setup:   0 - 0%, 0%
  Process: 825 - 61%, 61%
Total:
  Setup:   3
  Process: 1338

The lack of benefit is likely again because the JVM already inlined the doEval() method. Experiments replacing shifts with additions showed no benefits because CPUs are pretty good at shifts. Such are the joys of performance tuning. Still, for the reasons cited above, the final code is what we want the code generator to produce.

Conclusion

The above provided the following savings:

Original run time: 1784 ms.
Best new run time: 1490
Overall savings: ~300 ms or about 17%.

The filter itself showed a greater improvement:

Original filter time: 575 ms.
Best new filter time: 263
Overall savings: ~310 ms or 54%

The filter percentage dropped from 38% (of the original run time) to 20% (of the smaller run time.)

This did, however, push the selection vector remover from 12% to 15% of run time and so might benefit from its own round of optimization.

The test was run on the simplest possible query: a row containing only a required integer. Nullable ints would show greater savings because of the extra level of indirection (first check the null bit vector, then the data vector.) Compound conditions (x > 10 AND x < 20) would also show greater benefit as more time would be spent in the generated code relative to the rest of the query execution.

This experiment again suggests we may want to find the opportunity to improve code generation across the board to eliminate unnecessary method calls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly