-
Notifications
You must be signed in to change notification settings - Fork 980
BH Easy Format Plugin
One of Drill's compelling features is the ability for users to create storage and format plugins. As with any product, the first cut of code focuses on making things work. The result is that first-cut APIs can be complex, and this is certainly true of the Drill storage and plugin APIs. The developer must know Drill internals including memory management, the planner, vectors, physical plans and on and on.
The original version of Drill attempted to address this challenge with something called the "easy format plugin". Format plugins apply to files. The Easy framework provides wrappers and defaults for most common operations.
Many of Drill's own readers use this framework. In particular, the CSV and JSON readers upgraded in this project are both Easy plugin implementations.
The project needed to upgrade the CSV and JSON readers, while leaving other readers unchanged. This leads to a set of requirements for revisions to the Easy plugin:
- Allow the Easy plugin to use the original "legacy"
RecordReader
based readers. - Allow newer plugins to use the new
ManagedReader
based structure. - Allow Easy format plugins to create readers incrementally rather than as an up-front list.
- Provide additional simplifications as needed to avoid redundant code.
The existing EasyFormat plugin probably evolved over time. At present, the constructor takes 11 arguments. Most would consider this a bit cumbersome. Since this project needed to create a new constructor (for the new implementation), adding more parameters seemed a bad idea.
As it turns out, most parameters simply provide configuration settings: does this plugin allow projection push down? What is the default name? What extensions are supported?
This leads to an easy simplification, pull the configuration options into a new class, EasyFormatConfig
(nested inside the easy format plugin class.)
For backward compatibility, the existing constructor will build the EasyFormatConfig
from arguments. New code can just build the EasyFormatConfig
directly.
The legacy version of the Easy format plugin uses the original ScanRecordBatch
. The revised version uses the scan operator and framework described here. How does the Easy format plugin handle both without complex if-statements? The answer is to factor the logic out into a helper class. The interface ScanBatchCreator
defines the services. The ClassicScanBatchCreator
class creates the original scan batch, while the ScanFrameworkCreator
implementation creates the new version.
A method, scanBatchCreator()
returns one or the other. For backward compatibility, it returns the legacy version by default. Plugins that support the new version use this version to create, and configure, a scan oprerator with the proper framework and options.
The legacy version of the scan operator requires an instance of RecordReader
. The Easy plugin already provides the getRecordReader()
which each plugin overrides to create their own custom reader.
The newer version requires an instance of ManagedReader
. However, the readers are created on the fly via a helper class created when setting up the scan framework. So, there is no equivalent method in the Easy format plugin to create the reader. For plugins that use the new new format, the getRecordReader()
is simply ignored.
Since the Easy format plugin was not the focus of this project, the changes described above where designed to be as unintrusive as possible. Clearly it would be cleaner to split the Easy format plugin into three classes: an abstract base class, and base classes for the legacy and revised scan frameworks. This is left as an excise for later once we gain more experience with the new framework.
Also, the mechanism for creating the new framework is still a bit experimental. (First make it work, then make it fancy.) There are probably opportunities to streamline the process a bit.