Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-6820: Msgpack format reader #1500

Open
wants to merge 47 commits into
base: master
Choose a base branch
from
Open

Conversation

jcmcote
Copy link
Contributor

@jcmcote jcmcote commented Oct 11, 2018

Implementation of a msgpack format reader

  • schema learning
  • skip over malformed records
  • skip over invalid field names
  • skip over records not matching schema
  • writing msgpack has not yet been implemented

implementation of a zstandard codec

  • only decompression is implemented

jcmcote and others added 23 commits September 26, 2018 16:14
add support for msgpack extended types
fix issue with columns of INT which then encounter a BIGINT
fixed bug in reader (array of array)
new test case that require schema
added useSchema property to turn off schema utilization
after doing some performance tests I concluded that throwing exceptions
1 out of 10000 records had not significant impact on performance and
makes the code much easier to understand.

also consolidated the reader count reader into a single reader class
that can do count or actual reading of records. Again much easier to
understand the code like this.
@vdiravka
Copy link
Member

@jcmcote could you add a corresponding JIRA as a prefix in the title of the pull request? Refer the format of other pull requests here: https://github.com/apache/drill/pulls

Your Name added 2 commits October 30, 2018 12:49
coercing values into target schema types
coercing values into target schema types
@jcmcote jcmcote changed the title Msgpack format reader DRILL-6820: Msgpack format reader Oct 31, 2018
Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool addition. MsgPack, like Parquet, should provide the additional schema information to avoid the messy reality of JSON.

This is a partial review with an initial back of comments as I learned the code. Will follow up with the remaining files, then probably take a deeper second pass.

import java.io.IOException;
import java.nio.ByteBuffer;

import org.apache.commons.logging.Log;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out Drill requires the use of Logback logging, the build should have complained about illegal imports of common logging.

* A {@link Compressor} based on the snappy compression algorithm.
* http://code.google.com/p/snappy/
*
* jccote !!!DID NOT TEST THIS CLASS, JUST RENAMED SNAPPY FOR ZSTD!!!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please identify the source of this class: GitHub URL or the like.

<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe indent this block so it looks like:

  <build>
    <plugins>
      <plugin>
        <groupId>...

return parseErrorCount + runningRecordCount + recordCount + 1;
}

public void handleAndRaise(String suffix, Exception e) throws UserException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where you'd use UserException and its builders rather than a custom version.

valueWriterMap.put(ValueType.BOOLEAN, new BooleanValueWriter());
valueWriterMap.put(ValueType.STRING, new StringValueWriter());
valueWriterMap.put(ValueType.BINARY, new BinaryValueWriter());
valueWriterMap.put(ValueType.EXTENSION, new ExtensionValueWriter());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat confused here; some comments may help.

Presumably, the file can contain any number of FLOAT fields, including 0. Each field needs its own state (reader, parser, whatever.) Each has its own value vector. How does it work to have one "ValueWriter" per type that takes no parameters to say which field or vector is being worked on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put more comments in the code. Basically this is a switch implemented using an EnumMap. In the ComplexValueWriter I use this switch to lookup what class will handle writing a value type. Here's the line of code from that writeElement method

  valueWriterMap.get(value.getValueType()).write(value, mapWriter, fieldName, listWriter, selection, schema);

So based on the value type I get the corresponding writer class to use.


//
//
// private void ensure(final int length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear that this reader is one that needs an off-heap work buffer.

@vvysotskyi
Copy link
Member

@jcmcote, in HADOOP-13578 was added ZStandard Compression to the hadoop library. I think it would be better to collaborate with existing well-tested implementation instead of introducing the custom one.

@jcmcote
Copy link
Contributor Author

jcmcote commented Nov 6, 2018

@jcmcote, in HADOOP-13578 was added ZStandard Compression to the hadoop library. I think it would be better to collaborate with existing well-tested implementation instead of introducing the custom one.

Agreed. When will drill pickup the new version of Hadoop. Is that a big deal to upgrade the version of Hadoop used?

@vdiravka
Copy link
Member

vdiravka commented Nov 7, 2018

@jcmcote There is a Jira ticket for Hadoop libs version update: DRILL-6540.
There is an issue related to commons-logging, see details.
Also there is my "work in progress" branch in the ticket.

@vvysotskyi
Copy link
Member

@jcmcote, Is it possible to split this pull request into two parts: leave here only changes connected with Msgpack format reader, and continue work on Compression codecs in the scope of a separate Jira after upgrade of Hadoop library is done?

@jcmcote
Copy link
Contributor Author

jcmcote commented Nov 7, 2018

@vvysotskyi Sure I can split them up. Should be easy to do.

@jcmcote
Copy link
Contributor Author

jcmcote commented Jan 10, 2019

Hey @paul-rogers I've made many code review fixes and improvements to the msgpack reader. Could you have another look at it. I would very much like to have it approved and made part of the main code base. Thanks!

@arina-ielchiieva
Copy link
Member

@jcmcote taking into account that there is ongoing work to provide schema using file (https://issues.apache.org/jira/browse/DRILL-6835). You might consider waiting for those changes to be published to use common approach of reading and writing schema files.

@jcmcote
Copy link
Contributor Author

jcmcote commented Jan 10, 2019 via email

@cgivre cgivre added the enhancement PRs that add a new functionality to Drill label Jan 24, 2019
@cgivre
Copy link
Contributor

cgivre commented Jul 21, 2019

Hi @jcmcote
Are you still interested in completing this PR? Recently, the enhanced vector format PRs were committed and could make this better and easier.

If you haven't seen this, here's a link to the tutorial by @paul-rogers https://github.com/paul-rogers/drill/wiki/EVF-Tutorial-Row-Batch-Reader.

@cgivre
Copy link
Contributor

cgivre commented Sep 17, 2019

Hi @jcmcote Are you still interested in completing this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement PRs that add a new functionality to Drill
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants