Skip to content

Conversation

@uzadude
Copy link
Collaborator

@uzadude uzadude commented Nov 2, 2021

Summary

Adding support for indexing CSV files.

How was it tested?

added unit tests

* Regular seek. Called once per offset (block).
*/
override def seek(offset: Long): Unit = {
// @TODO - need to better optimize. not to re-init on every seek
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess this is only for random fetches, right? The performance will be impractical for batch...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but than again, why not just save the reader open? am I missing anything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, the requested use-case is only fetches. but sure I would like to solve for both. started just with something that works.
In CSV we have only offset, sub-offset is almost only zero.
let's say we want to read every other row, then we need to understand when it will be better just to skip a row compared to when will we want to "re-init" from a new offset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I got it... In the parquet implementation we have basically the same thing, right? I mean just one offset, when the row group is 128MB or more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants