Updates to Learning Apache Drill Book

Updates to "Learning Apache Drill"

Charles Givre and I wrote the book "Learning Apache Drill", published by O'Reilly based on the then-current Drill 1.13 version. Since then the amazing Drill team has continued to advance the project. Recorded below are a few of the most important changes since the book was published. See the official Drill release notes for details.

Table Schema

Chapter 8, "Data Engineering with Drill" explains how to work with messy data (which, unfortunately, often turns out to be what is actually available.) The chapter points out some caveats when working with Drill's "schema on read" approach to data files. Many of these deal with how Drill handles (or does not handle) ambiguities in your data files, and how you can work around these issues.

We are happy to report that the Drill community has responded with a wonderful new feature called (awkwardly) a "provided schema", created by the CREATE OR REPLACE SCHEMA command. Drill 1.16 introduced the feature, in experimental fashion, for the text (CSV) reader. Drill 1.17 promises to make the feature available more widely.

In a nutshell, provided schema:

Allows you to create a special ".drill.schema" file in your table directory that tells Drill explicitly how to handle the schema for that table, resolving ambiguities that otherwise might occur.
Provides a solution for schema evolution: you can identify columns that appear in some (newer or older) files, but not in other (older or newer) files. Drill will automatically fill in a default value you provide.
Provides automatic type conversion from the text in a text file to one of Drill's other types. For structured fields (dates, times, etc.) you can provide a format to describe how the data appears in your file.
"Strict" or "lenient" mode that lets you tell Drill to either ignore all column except those in the schema ("strict" mode) or to start with the provided columns and add any additional columns that may appear in the file (the default "lenient" mode.)

The Drill provided schema works a bit like the Hive metastore (HMS) used by Hive, Impala and Presto (and optionally by Drill.) With HMS, you must install and maintain an HMS server and you must define a schema for every file before you query it. Drill's provided schema is optional, and is stored in the file system. This makes provided schemas light-weight; you add them only when they provide value, otherwise you simply point Drill at your file and run a query as in previous Drill versions.

Drill 1.17 will add additional features (currently available for preview if you build Drill from the master branch):

Table properties that let you customize storage plugin properties directly in the schema file.

Normally Drill uses file extensions to locate a storage plugin config, and uses properties of the storage plugin config to determine how to handle that particular file type. (See ???). Provided schema table properties add another level of control: you can override storage config properties. For example, you might have ".csv" files both with and without headers. In Drill 1.16 and before, you had to rename the files with headers to ".csvh" so that they would trigger the proper storage plugin config. Starting in Drill 1.17, you can leave the files named ".csv", but use provided schema table properties to tell Drill that these particular files have headers. You can do the same with all text file properties (delimiters, quote characters, comment character, etc.)

Specific Revisions

Provided schema allows us to revise certain passages in Chapter 8.

File Format Variations

However, because CSV has so much variation, you might find a situation in which some of your CSV files have headers whereas others do not, yet both variations end with .csv.

Describes how to create storage plugin configs for each variation. Starting in Drill 1.17, you can also use "provided schema" table properties to handle the variation.

The new rule is:

Drill uses the file extension to obtain the storage plugin config.
Drill obtains text format properties from the storage plugin config.
Drill will override text format properties with those defined specifically for that table using the CREATE or REPLACE SCHEMA command.
Drill will override all the above properties with those provided in the query if you use the table properties within your query. (Yes, at present, you cannot override the above properties, you replace all of them. This is seen more as a but than a feature.)

Schema Inference Overview

The Schema Inference Overview section explains how Drill works out the schema for each file within your table.

The conclusion is that Drill as a whole is very flexible in dealing with multiple schemas per table, but various operators (such as the sort operator discussed earlier) and clients (such as JDBC and ODBC) have inherent restrictions that limit your ability to exploit divergent schemas. If your tables have a single, consistent schema, everything just works—but if your schema varies or is ambiguous, you need to know some tricks to make things work. We explore those tricks a bit later.

One of the tricks you now have available is the provided schema, which resolves the schema ambiguities described in this section. Think of this section as an explanation of the issues you can now use provided schemas to solve.

At the time of this writing, the Drill community was beginning to discuss adding a schema mechanism to Drill.

Indeed, the mechanism hinted at in the above passage is now starting to become available.

Schema Inference for Delimited Data, CSV with header

In addition to the standard inference described in this section, the provided schema adds additional control. The provided schema is matched to the file schema via column names. The column name in the provided schema must match (ignoring case) the name that appears in the file header. When a match occurs, Drill uses the data type from the provided schema. If the column appears in the file, Drill converts the file data to the desired type. If the file does not appear in the file, Drill still uses the desired type, and fills the column with the optional default value.

Explicit projection

Drill allows you to request columns that don't actually appear in your table or file which occurs if your table schema evolves over time. Prior to Drill 1.17, a "missing" column in the text reader was create with type nullable INT. Starting in Drill 1.17 (and in Drill 1.16 with exec.storage.enable_v3_text_reader set to true), Drill creates "missing" text columns with type non-nullable VARCHAR. When using a provided schema, Drill creates missing columns using the type provided in the schema, along with the default value defined in the schema.

Casts to specify types

The provided schema mechanism is an alternative to using explicit CAST operators in your query, and is an alternative to defining a view that contains the casts. The provided schema performs the conversion at data read time, which (when supported in the JSON reader), can resolve type conflicts that are otherwise not resolvable with the CAST operator.

CSV without a header row

This section describes how Drill uses the columns array column to hold data when your CSV file has no headers. With provided schema support, you now have a new option: define your columns in the schema. When you do this, Drill no longer uses the columns column, but instead splits your data into individual columns, just as if your file had column headers. See DRILL-7279 for details.

Using the provided schema makes your queries easier to read because you use column names rather than indexes. It also makes your queries more robust: you don't have to change them if column order changes.

Schema Inference for JSON

Provided schema support is not (yet) available for JSON. Once it is, this section will be updated to show how to use the feature to resolve the schema issues discussed in this section.

Aligning Schemas Across Files

The provided schema mechanism is the preferred solution to align schemas across files. It is the only reliable way to share information across distributed readers so that all readers produce a consistent schema.

Errata in Chapter 8, Data Engineering with Drill

Last-minute changes to the example directory were, unfortunately, not reflected in the book text.

Page 138: In the storage plugin definition, the correct path is:

"/Users/arina/drillbook/data"

Query at the bottom of the page should be:

SELECT * FROM local.data.csv/cust.csv;

The semicolon is needed if you try the query in SQLline.

Page 139: Top-most query: same change as above.

Page 141: Top-most query: same change as above.

Page 145: Query at the top of the page: to try this yourself, use a sample data file. For example:

SELECT t.a, t.b FROM (
  SELECT columns[0] AS a, columns[1] AS b
  FROM `local`.`data`.`csv/cust.csv`
  ORDER BY columns[0]) t
WHERE a < '124';

However, if you use the sample data, the plan you will see will be simpler than the one in the book because there is not enough data to need an exchange. This means all the boxes will be a single color because they all run in a single fragment.

Page 146-147: The csvh format is defined already if you reuse Drill's standard storage plugin configurations. Add the csvh format definition only if you create your own DFS-based storage plugin configuration.

Page 150: Semicolon missing from the "AVG" query on the bottom of the page.

Page 152: As of Drill 1.17, the sqlTypeOf() function is broken (DRILL-7499). Rather than the CHARACTER VARYING result shown in the book, you will see the following incorrect results:

+-----------+-----------+
| cols_type | cols_mode |
+-----------+-----------+
| ARRAY     | ARRAY     |
+-----------+-----------+

Page 152: Add AS loyalty to the end of the 'CASE` statement in the bottom-most query:

SELECT custId, custName, balance,
    CASE WHEN loyalty IS NULL THEN '' ELSE loyalty END AS loyalty
FROM (
  SELECT columns[0] AS custId,
      columns[1] AS custName, 
      columns[2] AS balance,
      columns[3] AS loyalty
  FROM `csv/cust.csv`);

Page 158: The book refers to the file gen/70kmissing.json, but the file is not in the book's GitHub repo. Instead, you must generate the file:

export BOOK_HOME=/directory/of/book
cd $BOOK_HOME/gen/scripts
python gen70Kmissing.py
python gen70Kempty.py
python gen70Knull.py
python genLargeCsv.py
python partitions.py

You will now have the $BOOK_HOME/data/gen directory:

70kempty.json  70kmissing.json  70knulls.json  large.csv

And you will have the partitioned $BOOK_HOME/data/sales directory.

Page 157: Before starting the "Missing values" section, reset the all_text_mode session option:

ALTER SESSION RESET `store.json.all_text_mode`;

Page 158: Bottom query. The first line of the table suggests that Drill 1.7 uses the NULL type for missing values, where Drill 1.14 used INT. The second line is incorrect in the book. In Drill 1.17, the table will appear as:

+-------+-----------+-------------------+----------+
|   a   |     b     |      b_type       |  b_mode  |
+-------+-----------+-------------------+----------+
| 1     | null      | NULL              | NULLABLE |
| 70001 | hi there! | CHARACTER VARYING | NULLABLE |
+-------+-----------+-------------------+----------+

Page 159: The bug mentioned in the text, DRILL-6359, is still not fixed in Drill 1.17.

Page 159: The query was based on an earlier version of the target file. To try this query:

cd $BOOK_HOME/data/gen
cp 70kmissing.json 70kdouble.json
vi 70kdouble.json

Change the last line to:

{a: 70001, b: 10.5}

Then, change the query to match:

SELECT a, CAST(b AS DOUBLE) AS b
FROM `gen/70kdouble.json`
WHERE b IS NOT NULL ORDER BY a;

Page 161: The files checked to the book's GitHub repo are after the suggested file rename. Before trying the queries, put the file back to the proper name:

cd $BOOK_HOME/json/missing
mv file3.json file1.json

The schema mismatch problem will appear randomly since Drill orders file scans randomly. You can see this by including the filename column in the query:

SELECT a, b, filename FROM `json/missing` ORDER BY a;
+---+------+------------+
| a |  b   |  filename  |
+---+------+------------+
| 1 | null | file1.json |
| 2 | foo  | file2.json |
+---+------+------------+

SELECT a, b, filename FROM `json/missing` ORDER BY a;
+---+------+------------+
| a |  b   |  filename  |
+---+------+------------+
| 1 | null | file3.json |
| 2 | foo  | file2.json |
+---+------+------------+

One way I found to force the error is to create a subdirectory and move file 2 there:

cd $BOOK_HOME/data/json/missing
mkdir sub
mv file2.json sub

Page 163: The sqlTypeOf() reports the wrong name for a MAP in Drill 1.17 (see DRILL-7501):

+-----------+
| name_type |
+-----------+
| STRUCT    |
+-----------+

Page 167: On windows, file names are case sensitive. The name 70Kempty.json should be 70kempty.json.

Writing a Format Plug-in

Chapter 12, Writing a Format Plug-in, explains how to write a plugin to read a custom data file format. The chapter uses the regex "log" reader as an example. Since the chapter was written, the Drill team has introduced significant new functionality to help you write better storage plugins. The information in the chapter still works, but is not considered a bit "old school." The same mechanisms that allowed the text reader to support schemas (as described above), is available to you for your own plugins.

Much of this work is still being completed and will be final in Drill 1.17. You can find the tutorial for the new "Enhanced Vector Framework" (EVF) right here in this wiki. The tutorial section, in particular, is written from the perspective of someone who has read Chapter 12, has built a plugin using that material, and now wants to convert that plugin to use EVF. The tutorial also focuses on the regex plugin.

If you are writing a new plugin, you can follow along in both Chapter 12 and the tutorial to see how to use EVF. Not ideal, but the best available until Chapter 12 itself can be updated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly