-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-6820: Msgpack format reader #1500
base: master
Are you sure you want to change the base?
Conversation
add support for msgpack extended types fix issue with columns of INT which then encounter a BIGINT
fixed bug in reader (array of array) new test case that require schema added useSchema property to turn off schema utilization
after doing some performance tests I concluded that throwing exceptions 1 out of 10000 records had not significant impact on performance and makes the code much easier to understand. also consolidated the reader count reader into a single reader class that can do count or actual reading of records. Again much easier to understand the code like this.
@jcmcote could you add a corresponding JIRA as a prefix in the title of the pull request? Refer the format of other pull requests here: https://github.com/apache/drill/pulls |
coercing values into target schema types
coercing values into target schema types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool addition. MsgPack, like Parquet, should provide the additional schema information to avoid the messy reality of JSON.
This is a partial review with an initial back of comments as I learned the code. Will follow up with the remaining files, then probably take a deeper second pass.
import java.io.IOException; | ||
import java.nio.ByteBuffer; | ||
|
||
import org.apache.commons.logging.Log; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out Drill requires the use of Logback logging, the build should have complained about illegal imports of common logging.
* A {@link Compressor} based on the snappy compression algorithm. | ||
* http://code.google.com/p/snappy/ | ||
* | ||
* jccote !!!DID NOT TEST THIS CLASS, JUST RENAMED SNAPPY FOR ZSTD!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please identify the source of this class: GitHub URL or the like.
contrib/codec-zstd/src/main/java/org/apache/hadoop/io/compress/zstd/ZstdDecompressor.java
Outdated
Show resolved
Hide resolved
contrib/format-msgpack/pom.xml
Outdated
<build> | ||
<plugins> | ||
<plugin> | ||
<groupId>org.codehaus.mojo</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe indent this block so it looks like:
<build>
<plugins>
<plugin>
<groupId>...
...ib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackFormatPlugin.java
Outdated
Show resolved
Hide resolved
return parseErrorCount + runningRecordCount + recordCount + 1; | ||
} | ||
|
||
public void handleAndRaise(String suffix, Exception e) throws UserException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's where you'd use UserException
and its builders rather than a custom version.
contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackReader.java
Outdated
Show resolved
Hide resolved
contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackReader.java
Outdated
Show resolved
Hide resolved
valueWriterMap.put(ValueType.BOOLEAN, new BooleanValueWriter()); | ||
valueWriterMap.put(ValueType.STRING, new StringValueWriter()); | ||
valueWriterMap.put(ValueType.BINARY, new BinaryValueWriter()); | ||
valueWriterMap.put(ValueType.EXTENSION, new ExtensionValueWriter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat confused here; some comments may help.
Presumably, the file can contain any number of FLOAT fields, including 0. Each field needs its own state (reader, parser, whatever.) Each has its own value vector. How does it work to have one "ValueWriter" per type that takes no parameters to say which field or vector is being worked on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put more comments in the code. Basically this is a switch implemented using an EnumMap. In the ComplexValueWriter I use this switch to lookup what class will handle writing a value type. Here's the line of code from that writeElement method
valueWriterMap.get(value.getValueType()).write(value, mapWriter, fieldName, listWriter, selection, schema);
So based on the value type I get the corresponding writer class to use.
|
||
// | ||
// | ||
// private void ensure(final int length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear that this reader is one that needs an off-heap work buffer.
@jcmcote, in HADOOP-13578 was added ZStandard Compression to the hadoop library. I think it would be better to collaborate with existing well-tested implementation instead of introducing the custom one. |
Agreed. When will drill pickup the new version of Hadoop. Is that a big deal to upgrade the version of Hadoop used? |
@jcmcote There is a Jira ticket for Hadoop libs version update: DRILL-6540. |
@jcmcote, Is it possible to split this pull request into two parts: leave here only changes connected with Msgpack format reader, and continue work on Compression codecs in the scope of a separate Jira after upgrade of Hadoop library is done? |
@vvysotskyi Sure I can split them up. Should be easy to do. |
refactored the extension mechanism
avoiding creating objects and copying data into byte arrays not decoding field names but instead map lookup on bytes
it now uses a hashmap to avoid re-creating String for the map keys
Hey @paul-rogers I've made many code review fixes and improvements to the msgpack reader. Could you have another look at it. I would very much like to have it approved and made part of the main code base. Thanks! |
@jcmcote taking into account that there is ongoing work to provide schema using file (https://issues.apache.org/jira/browse/DRILL-6835). You might consider waiting for those changes to be published to use common approach of reading and writing schema files. |
okay sounds good
…On Thu, Jan 10, 2019 at 9:54 AM Arina Ielchiieva ***@***.***> wrote:
@jcmcote <https://github.com/jcmcote> taking into account that there is
ongoing work to provide schema using file (
https://issues.apache.org/jira/browse/DRILL-6835). You might consider
waiting for those changes to be published to use common approach of reading
and writing schema files.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1500 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJoEwoWtRJHSjuYjXhk7st8u65k9vua_ks5vB1QXgaJpZM4XXfMY>
.
|
Hi @jcmcote If you haven't seen this, here's a link to the tutorial by @paul-rogers https://github.com/paul-rogers/drill/wiki/EVF-Tutorial-Row-Batch-Reader. |
Hi @jcmcote Are you still interested in completing this PR? |
Implementation of a msgpack format reader
implementation of a zstandard codec