-
Notifications
You must be signed in to change notification settings - Fork 108
Feature/spark dataframes #646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| def writeParquet(url: String, overwrite: Boolean = false)(implicit ev: Out =:= Record): Unit = | ||
| writeParquetJava(url, overwrite, preferDataset = false) | ||
|
|
||
| def writeParquetAsDataset(url: String, overwrite: Boolean = true)(implicit ev: Out =:= Record): Unit = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it nicer for the API here if writeParquet gets a preferDataset parameter that is false by default?
This would allow to remove writeParquetAsDataset altogether.
| * @param projection the projection, if any | ||
| * @return [[DataQuantaBuilder]] for the file | ||
| */ | ||
| def readParquetAsDataset(url: String, projection: Array[String] = null): UnarySourceDataQuantaBuilder[UnarySourceDataQuantaBuilder[_, Record], Record] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same thought goes for this.
We could use readParquet with the flag here as well.
| * @param projection the projection, if any | ||
| * @return [[DataQuanta]] of [[Record]]s backed by a Spark Dataset when executed on Spark | ||
| */ | ||
| def readParquetAsDataset(url: String, projection: Array[String] = null): DataQuanta[Record] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same thought goes for this.
We could use readParquet with the flag here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch - yeah, good idea. Its a bit duplicated given some month of coding :)
…t the unified interface.
Summary
README.md,guides/spark-datasets.md).Next steps / follow-ups
double[]/DoubleRDDs. We should extend them to useDatasetChannels once schema handling is in place.