-
Notifications
You must be signed in to change notification settings - Fork 16
Adding support for CSV files #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/0.5
Are you sure you want to change the base?
Conversation
| * Regular seek. Called once per offset (block). | ||
| */ | ||
| override def seek(offset: Long): Unit = { | ||
| // @TODO - need to better optimize. not to re-init on every seek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess this is only for random fetches, right? The performance will be impractical for batch...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but than again, why not just save the reader open? am I missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, the requested use-case is only fetches. but sure I would like to solve for both. started just with something that works.
In CSV we have only offset, sub-offset is almost only zero.
let's say we want to read every other row, then we need to understand when it will be better just to skip a row compared to when will we want to "re-init" from a new offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I got it... In the parquet implementation we have basically the same thing, right? I mean just one offset, when the row group is 128MB or more
Summary
Adding support for indexing CSV files.
How was it tested?
added unit tests