Skip to content

Conversation

@novatechflow
Copy link
Member

Summary

  • Add Spark Dataset/DataFrame plumbing: Parquet source/sink flag, channel conversions, optimizer cost hints.
  • Document how to build dataset-backed pipelines (README.md, guides/spark-datasets.md).

Next steps / follow-ups

  • ML4All pipelines still emit/consume raw double[]/Double RDDs. We should extend them to use DatasetChannels once schema handling is in place.
  • Text/Object sources currently produce RDD channels. A Record-backed variant (or a conversion helper) would allow dataset output without extra user code.

def writeParquet(url: String, overwrite: Boolean = false)(implicit ev: Out =:= Record): Unit =
writeParquetJava(url, overwrite, preferDataset = false)

def writeParquetAsDataset(url: String, overwrite: Boolean = true)(implicit ev: Out =:= Record): Unit =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it nicer for the API here if writeParquet gets a preferDataset parameter that is false by default?
This would allow to remove writeParquetAsDataset altogether.

* @param projection the projection, if any
* @return [[DataQuantaBuilder]] for the file
*/
def readParquetAsDataset(url: String, projection: Array[String] = null): UnarySourceDataQuantaBuilder[UnarySourceDataQuantaBuilder[_, Record], Record] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thought goes for this.

We could use readParquet with the flag here as well.

* @param projection the projection, if any
* @return [[DataQuanta]] of [[Record]]s backed by a Spark Dataset when executed on Spark
*/
def readParquetAsDataset(url: String, projection: Array[String] = null): DataQuanta[Record] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thought goes for this.

We could use readParquet with the flag here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - yeah, good idea. Its a bit duplicated given some month of coding :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants