Skip to content

BH Easy Format Plugin

Paul Rogers edited this page Jan 15, 2018 · 1 revision

One of Drill's compelling features is the ability for users to create storage and format plugins. As with any product, the first cut of code focuses on making things work. The result is that first-cut APIs can be complex, and this is certainly true of the Drill storage and plugin APIs. The developer must know Drill internals including memory management, the planner, vectors, physical plans and on and on.

The original version of Drill attempted to address this challenge with something called the "easy format plugin". Format plugins apply to files. The Easy framework provides wrappers and defaults for most common operations.

Many of Drill's own readers use this framework. In particular, the CSV and JSON readers upgraded in this project are both Easy plugin implementations.

Requirements

The project needed to upgrade the CSV and JSON readers, while leaving other readers unchanged. This leads to a set of requirements for revisions to the Easy plugin:

  • Allow the Easy plugin to use the original "legacy" RecordReader based readers.
  • Allow newer plugins to use the new ManagedReader based structure.
  • Allow Easy format plugins to create readers incrementally rather than as an up-front list.
  • Provide additional simplifications as needed to avoid redundant code.

EasyFormatConfig

The existing EasyFormat plugin probably evolved over time. At present, the constructor takes 11 arguments. Most would consider this a bit cumbersome. Since this project needed to create a new constructor (for the new implementation), adding more parameters seemed a bad idea.

As it turns out, most parameters simply provide configuration settings: does this plugin allow projection push down? What is the default name? What extensions are supported?

This leads to an easy simplification, pull the configuration options into a new class, EasyFormatConfig (nested inside the easy format plugin class.)

For backward compatibility, the existing constructor will build the EasyFormatConfig from arguments. New code can just build the EasyFormatConfig directly.

ScanBatchCreator

The legacy version of the Easy format plugin uses the original ScanRecordBatch. The revised version uses the scan operator and framework described here. How does the Easy format plugin handle both without complex if-statements? The answer is to factor the logic out into a helper class. The interface ScanBatchCreator defines the services. The ClassicScanBatchCreator class creates the original scan batch, while the ScanFrameworkCreator implementation creates the new version.

A method, scanBatchCreator() returns one or the other. For backward compatibility, it returns the legacy version by default. Plugins that support the new version use this version to create, and configure, a scan oprerator with the proper framework and options.

Create the Reader

The legacy version of the scan operator requires an instance of RecordReader. The Easy plugin already provides the getRecordReader() which each plugin overrides to create their own custom reader.

The newer version requires an instance of ManagedReader. However, the readers are created on the fly via a helper class created when setting up the scan framework. So, there is no equivalent method in the Easy format plugin to create the reader. For plugins that use the new new format, the getRecordReader() is simply ignored.

Future Improvements

Since the Easy format plugin was not the focus of this project, the changes described above where designed to be as unintrusive as possible. Clearly it would be cleaner to split the Easy format plugin into three classes: an abstract base class, and base classes for the legacy and revised scan frameworks. This is left as an excise for later once we gain more experience with the new framework.

Also, the mechanism for creating the new framework is still a bit experimental. (First make it work, then make it fancy.) There are probably opportunities to streamline the process a bit.

Clone this wiki locally