-
Notifications
You must be signed in to change notification settings - Fork 980
EVF Tutorial Special Columns
The log reader handles two special columns:
-
_unmatched_rows
: Value of any unmatched row. -
_raw
: The entire unparsed row.
These columns also introduce some interesting semantics:
- The special columns do not appear when doing a wildcard (
*
) query. The user must request them explicitly. - If the
SELECT
clause contains no fields, and does not contain the_raw
field, then don't save the matched rows. - If the
SELECT
clause contains the_unmatched_rows
column, then save those rows, else skip them. - However, if the
SELECT
clause contains nothing (because we're processing aSELECT COUNT(*)
query, then do save the (empty) matched row.
The semantics here are unique to the log reader, but they do allow us to show how to use EVF to handle such odd cases. Let's start with the basics.
Prior logic had to fit these into the array of columns, which was a bit awkward. We can revise this logic to exploit the EVF. We simply always define our "special" columns, storing their writers directly:
private static final String RAW_LINE_COL_NAME = "_raw";
private static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows";
...
private ScalarWriter rawColWriter;
private ScalarWriter unmatchedColWriter;
...
private TupleMetadata defineSchema() {
...
SchemaBuilder builder = new SchemaBuilder();
...
builder.addNullable(RAW_LINE_COL_NAME, MinorType.VARCHAR);
builder.addNullable(UNMATCHED_LINE_COL_NAME, MinorType.VARCHAR);
TupleMetadata schema = builder.buildSchema();
// Exclude special columns from wildcard expansion
schema.metadata(RAW_LINE_COL_NAME).setBooleanProperty(
ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
schema.metadata(UNMATCHED_LINE_COL_NAME).setBooleanProperty(
ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
return schema;
}
Some things to note:
- We add the special columns to the schema builder after the regex columns. This ensures that the column indexes for those columns are the same for both the
columns
array and the schema. - We define the special columns all the time; relying on EVF projection to materialize them only when needed.
- Once the schema is built, we retrieve the special columns and set the
EXCLUDE_FROM_WILDCARD
which tells EVF not to include these columns in a wildcard query.
Next we must revise our draft nextLine()
method to include the special columns. First, for the _raw
column:
if (lineMatcher.matches()) {
rowWriter.start();
rawColWriter.setString(line);
loadVectors(lineMatcher);
rowWriter.save();
}
...
rowWriter.start();
unmatchedColWriter.setString(line);
rowWriter.save();
return true;
}
We leverage the "dummy" feature of writers: if a special column is not projected, writing to it is a no-op.
As an aside, we could have written the following instead:
writer.scalar(UNMATCHED_LINE_COL_NAME).setString(line);
The above form is handy if you must work with columns by name rather than position, for example if working with JSON. In the log reader, however, we cache the writer for a slight performance gain. Use whichever works best for your plugin.
The above code for _unmatched_rows
is not quite right. According to the semantics identified earlier, we want to save the unmatched row only if the user requests it. This is easy enough to add using the isProjected()
method on a writer to tell us if if has an actual materialized vector (it is projected), or if it is a dummy, unprojected column:
if (unmatchedColWriter.isProjected()) {
rowWriter.start();
unmatchedColWriter.setString(line);
rowWriter.save();
}
The log reader has one more requirement: saved matched rows only in the following conditions:
Fields | _raw |
_unmatched_rows |
Save Matched Row? |
---|---|---|---|
Yes | N/A | N/A | Yes |
No | Yes | N/A | Yes |
No | No | Yes | No |
No | No | No | Yes |
The first three columns ask if the user asked for at least one field, or for the two special columns. The fourth column tells us if we should save the matched rows. The last row above may be surprising: if nothing is projected at all, we must at least start/save the matched rows so we can count them.
To record our decision, we add a flag, saveMatchedRows
. The exact logic is specific to the log reader, we just want to see how we can use EVF features to implement our rules.
If we did not have these complex rules, we could find out if we are in a COUNT(*)
case by calling the isProjectionEmpty()
method on the SchemaNegotiator
passed to the batch reader open()
method. When projection is empty, we just need to pass along a row count, we don't need to actually write any values. We might want to know that if we have some shortcut way to get the row count, such as from a file header or footer. See the isProjectionEmpty()
Javadoc for details.
We can extend our bindColumns()
function to use isProjected()
to compute saveMatchedRows
:
private void bindColumns(RowSetLoader writer) {
for (int i = 0; i < capturingGroups; i++) {
columns[i].bind(writer);
saveMatchedRows |= columns[i].colWriter.isProjected();
}
rawColWriter = writer.scalar(RAW_LINE_COL_NAME);
saveMatchedRows |= rawColWriter.isProjected();
unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME);
// If no match-case columns are projected, and the unmatched
// columns is unprojected, then we want to count (matched)
// rows.
saveMatchedRows |= !unmatchedColWriter.isProjected();
}
Finally, we can modify the nextLine()
function to conditionally save matched lines:
if (saveMatchedRows) {
rowWriter.start();
rawColWriter.setString(line);
loadVectors(lineMatcher);
rowWriter.save();
}
At this point, the special columns should work, and we should save rows per our complex requirements.
Next: Type Conversion