-
Notifications
You must be signed in to change notification settings - Fork 980
Optimization of Filter Performance
Building on the work done to optimize sort performance, this page looks at optimizing filter performance using the same techniques.
The test uses the same mock data generator (DRILL-5152, PR 708 to produce 10 million rows with one integer each. Data is randomly uniformly distributed in the Java int range. We use a WHERE clause to eliminate roughly half the rows:
SELECT id_i FROM `mock`.employee_50M WHERE id_i > 0
The above means to create an integer column (_i
) in a table of 50 million rows (_50m
). As it turns out, using only 10 million rows results in run times that are too quick to be easily analyzed.
We use the test framework from DRILL-5126, PR 710 to run timed tests using Oracle Java 8 on a Macbook Pro with an Intel i5 processor. The mock data source produces batches of about 32K in size (adjusted down from the default 256K.) This gives about 8K records per batch and about 6100 batches passed to the filter. All tests were run from within Eclipse, using the Run option (not debug) using an embedded Drillbit. Parallelism is set 1 one so that we run only one fragment per query.
The baseline runs the test five times, takes the average of all but the first, and prints details for the second-to-last:
Read 24992460 records in 6104 batches.
...
Avg run time: 1784
Op: 0 Screen
Setup: 0 - 0%, 0%
Process: 11 - 0%, 0%
Wait: 26
Op: 1 Project
Setup: 0 - 0%, 0%
Process: 6 - 0%, 0%
Op: 2 SelectionVectorRemover
Setup: 1 - 25%, 0%
Process: 189 - 12%, 12%
Op: 3 Filter
Setup: 3 - 75%, 0%
Process: 575 - 38%, 38%
Op: 4 Scan
Setup: 0 - 0%, 0%
Process: 709 - 47%, 47%
Total:
Setup: 4
Process: 1490
In the above, times are in ms. The two percentages are percentage of the category (setup or process) and percent of total run time. The average time is as measured from the client. The process + setup time is the total time seen in executing the one fragment for the query.
Observations:
- Selection vector remover setup takes 12% of the run time.
- Filter takes 38% of the run time.
- We are already fairly efficient because the mock scan (the bare minimum data source, no disk I/O) takes 47% of the run time.
The first thing to do is to apply the lessons already learned. Let's switch to plain Java code generation with the JDK.
Results:
Read 25003415 records in 6104 batches.
...
Avg run time: 1724
...
Op: 2 SelectionVectorRemover
Setup: 1 - 25%, 0%
Process: 178 - 12%, 12%
Op: 3 Filter
Setup: 3 - 75%, 0%
Process: 498 - 36%, 35%
Op: 4 Scan
Setup: 0 - 0%, 0%
Process: 690 - 49%, 49%
Total:
Setup: 4
Process: 1382
This got us a very small gain: about 60ms or about 3%. The important reason for the switch is that doing so allows us to hand-optimize the filter code.
Next, let's use the "plain Java" feature mentioned above to capture the generated filter code (with extra fluff removed), and the outer loop in the template:
public abstract class FilterTemplate2 implements Filterer {
public class FiltererGen0 extends FilterTemplate2 {
...
private void filterBatchNoSV(int recordCount) throws SchemaChangeException {
int svIndex = 0;
for(int i = 0; i < recordCount; i++){
if(doEval(i, 0)){
outgoingSelectionVector.setIndex(svIndex, (char)i);
svIndex++;
}
}
outgoingSelectionVector.setRecordCount(svIndex);
}
...
public boolean doEval(int inIndex, int outIndex) throws SchemaChangeException
{
IntHolder out3 = new IntHolder();
out3 .value = vv0 .getAccessor().get((inIndex));
BitHolder out6 = new BitHolder();
final BitHolder out = new BitHolder();
IntHolder left = out3;
IntHolder right = constant5;
GCompareIntVsInt$GreaterThanIntVsInt_eval: {
out.value = left.value > right.value ? 1 : 0; }
out6 = out;
return (out6 .value == 1);
}
public void doSetup(FragmentContext context, RecordBatch incoming, RecordBatch outgoing)
throws SchemaChangeException {
int[] fieldIds1 = new int[ 1 ] ;
fieldIds1 [ 0 ] = 0;
Object tmp2 = (incoming).getValueAccessorById(IntVector.class, fieldIds1).getValueVector();
if (tmp2 == null) {
throw new SchemaChangeException("Failure while loading vector vv0 with id: TypedFieldId [fieldIds=[0], remainder=null].");
}
vv0 = ((IntVector) tmp2);
IntHolder out4 = new IntHolder();
out4 .value = 0;
constant5 = out4;
IntHolder right = constant5;
}
}
The above suggests we can play the same tricks as for the sort operator:
- Replace vector code and holders with direct calls to the underlying buffers.
- Combine the template
filterBatchNoSV
with the generateddoEval
methods.
We now create a hand-crafted version of the generated code by copying the generated code into the Drill project and using it instead of the generated code in the FilterRecordBatch
:
// final Filterer filter = context.getImplementationClass(codeGen);
final Filterer filter = new FilterExp();
A quick sanity check shows no change in run time, as we'd expect: 1770 ms.
The above code generates temporary "holder" objects. We saw earlier evidence that the holders are optimized away by the JVM. And, the normal Drill byte manipulation path does "scalar replacement" to remove them. Still, let's try. The new code for doEval()
:
return vv0 .getAccessor().get((inIndex)) > 0;
Results:
Read 24997276 records in 6104 batches.
Avg run time: 1795
...
Op: 2 SelectionVectorRemover
Setup: 1 - 25%, 0%
Process: 176 - 12%, 12%
Op: 3 Filter
Setup: 3 - 75%, 0%
Process: 540 - 37%, 37%
Op: 4 Scan
Setup: 0 - 0%, 0%
Process: 694 - 48%, 48%
Total:
Setup: 4
Process: 1427
Average run time is basically unchanged (allowing for noise.) This suggests that Java does, in fact, do scalar replacement. No savings, but it puts us in a better position to do the next trick: work directly with the data buffer.
The next step is to directly access the value vector memory, cutting out the middlemen of the value vector, the accessor and the DrillBuf
, removing 6 function calls (counting those involved in bounds checking).
private long addr0;
public void doSetup(...) {
addr0 = vv0.getBuffer().memoryAddress();
...
public boolean doEval(int inIndex, int outIndex) {
{
int value = PlatformDependent.getInt(addr0 + (inIndex<<2));
return value > 0;
}
This saves about 170 ms or 9% overall. Time in the filter itself drops by 39%. (The filter time is measure from the one specific run shown, overall runtime is an average.)
Read 24996095 records in 6104 batches.
Avg run time: 1619
...
Op: 2 SelectionVectorRemover
Setup: 1 - 25%, 0%
Process: 201 - 15%, 15%
Op: 3 Filter
Setup: 3 - 75%, 0%
Process: 330 - 25%, 25%
Op: 4 Scan
Setup: 0 - 0%, 0%
Process: 755 - 57%, 57%
Total:
Setup: 4
Process: 1307
The above code still calls doEval()
for every value: all 50 million of them. Let's move the loop into the (hand-crafted) generated code:
for(int i = 0; i < recordCount; i++){
int value = PlatformDependent.getInt(addr0 + (i<<2));
if(value > 0){
outgoingSelectionVector.setIndex(svIndex, (char)i);
svIndex++;
}
Surprisingly, this saves no time. Actually, it is understandable: the doEval()
method is not so simple the JVM can do he optimization automatically, so doing it by hand adds nothing. Not, however, that in production code we must combine the two functions: the JVM can optimize this call only because the template ever calls one implementation; the optimization will likely fail once multiple subclasses are used.
Read 25001261 records in 6104 batches.
Avg run time: 1611
...
Op: 3 Filter
Setup: 3 - 75%, 0%
Process: 324 - 24%, 24%
We can play the same trick with the selection vector as with the data vector: work directly with the memory backing the selection vector.
protected void filterBatchNoSV(int recordCount) throws SchemaChangeException {
// long addrSV = outgoingSelectionVector.getBuffer().memoryAddress();
long addrSV = outgoingSelectionVector.getDataAddr();
int svIndex = 0;
for(int i = 0; i < recordCount; i++){
if(doEval(i, 0)){
PlatformDependent.putShort(addrSV + (svIndex<<1), (short) i);
svIndex++;
}
}
Note a "gotcha" in the above. For value vectors, getBuffer()
simply returns the DrillBuf
. But for some silly reason, the same method on selection vectors clears the buffer before returning it. Fun for the whole family tracking down that error...
The results are that run time drops another 121 ms. or 8%.
Read 24996154 records in 6104 batches.
Avg run time: 1490
...
Op: 2 SelectionVectorRemover
Setup: 1 - 25%, 0%
Process: 182 - 15%, 15%
Op: 3 Filter
Setup: 3 - 75%, 0%
Process: 263 - 22%, 22%
Op: 4 Scan
Setup: 0 - 0%, 0%
Process: 714 - 60%, 60%
Total:
Setup: 4
Process: 1175
The above was done with the doEval()
method in place. Let's again merge the doEval()
code and do number of other minor optimizations:
@Override
protected void filterBatchNoSV(int recordCount) throws SchemaChangeException {
long addr0 = vv0.getBuffer().memoryAddress();
long addrSV = outgoingSelectionVector.getDataAddr();
int svIndex = 0;
for(int i = 0; i < recordCount; i++) {
int value = PlatformDependent.getInt(addr0 + (i<<2));
if(value > 0) {
PlatformDependent.putShort(addrSV + (svIndex<<1), (short) i);
svIndex++;
}
}
outgoingSelectionVector.setRecordCount(svIndex);
}
This final change provides no real benefit:
Read 25000601 records in 6104 batches.
Avg run time: 1653
...
Op: 2 SelectionVectorRemover
Setup: 1 - 33%, 0%
Process: 214 - 15%, 15%
Op: 3 Filter
Setup: 2 - 66%, 0%
Process: 273 - 20%, 20%
Op: 4 Scan
Setup: 0 - 0%, 0%
Process: 825 - 61%, 61%
Total:
Setup: 3
Process: 1338
The lack of benefit is likely again because the JVM already inlined the doEval()
method. Experiments replacing shifts with additions showed no benefits because CPUs are pretty good at shifts. Such are the joys of performance tuning. Still, for the reasons cited above, the final code is what we want the code generator to produce.
The above provided the following savings:
- Original run time: 1784 ms.
- Best new run time: 1490
- Overall savings: ~300 ms or about 17%.
The filter itself showed a greater improvement:
- Original filter time: 575 ms.
- Best new filter time: 263
- Overall savings: ~310 ms or 54%
The filter percentage dropped from 38% (of the original run time) to 20% (of the smaller run time.)
This did, however, push the selection vector remover from 12% to 15% of run time and so might benefit from its own round of optimization.
The test was run on the simplest possible query: a row containing only a required integer. Nullable ints would show greater savings because of the extra level of indirection (first check the null bit vector, then the data vector.) Compound conditions (x > 10 AND x < 20) would also show greater benefit as more time would be spent in the generated code relative to the rest of the query execution.
This experiment again suggests we may want to find the opportunity to improve code generation across the board to eliminate unnecessary method calls.