Feature/spark dataframes #646

novatechflow · 2025-12-15T15:05:40Z

Summary

Add Spark Dataset/DataFrame plumbing: Parquet source/sink flag, channel conversions, optimizer cost hints.
Document how to build dataset-backed pipelines (README.md, guides/spark-datasets.md).

Next steps / follow-ups

ML4All pipelines still emit/consume raw double[]/Double RDDs. We should extend them to use DatasetChannels once schema handling is in place.
Text/Object sources currently produce RDD channels. A Record-backed variant (or a conversion helper) would allow dataset output without extra user code.

juripetersen · 2025-12-16T07:00:15Z

wayang-api/wayang-api-scala-java/src/main/scala/org/apache/wayang/api/DataQuanta.scala

+  def writeParquet(url: String, overwrite: Boolean = false)(implicit ev: Out =:= Record): Unit =
+    writeParquetJava(url, overwrite, preferDataset = false)
+
+  def writeParquetAsDataset(url: String, overwrite: Boolean = true)(implicit ev: Out =:= Record): Unit =


Is it nicer for the API here if writeParquet gets a preferDataset parameter that is false by default?
This would allow to remove writeParquetAsDataset altogether.

juripetersen · 2025-12-16T07:00:52Z

wayang-api/wayang-api-scala-java/src/main/scala/org/apache/wayang/api/JavaPlanBuilder.scala

+    * @param projection the projection, if any
+    * @return [[DataQuantaBuilder]] for the file
+    */
+  def readParquetAsDataset(url: String, projection: Array[String] = null): UnarySourceDataQuantaBuilder[UnarySourceDataQuantaBuilder[_, Record], Record] =


The same thought goes for this.

We could use readParquet with the flag here as well.

juripetersen · 2025-12-16T07:01:20Z

wayang-api/wayang-api-scala-java/src/main/scala/org/apache/wayang/api/PlanBuilder.scala

+   * @param projection the projection, if any
+   * @return [[DataQuanta]] of [[Record]]s backed by a Spark Dataset when executed on Spark
+   */
+  def readParquetAsDataset(url: String, projection: Array[String] = null): DataQuanta[Record] =


The same thought goes for this.

We could use readParquet with the flag here as well.

good catch - yeah, good idea. Its a bit duplicated given some month of coding :)

…t the unified interface.

novatechflow added 3 commits December 15, 2025 15:44

Spark DataFrames support / Optimizer load profiles

fabc628

Update readme / add documentation

5f42f3c

add license header

599508d

novatechflow requested a review from juripetersen December 15, 2025 15:06

novatechflow mentioned this pull request Dec 15, 2025

Enumeration is non-deterministic #634

Open

juripetersen reviewed Dec 16, 2025

View reviewed changes

Add Dataset flag to read/write Parquet APIs and update docs to reflec…

7b5d3b1

…t the unified interface.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/spark dataframes #646

Feature/spark dataframes #646

Uh oh!

novatechflow commented Dec 15, 2025

Uh oh!

juripetersen Dec 16, 2025

Uh oh!

juripetersen Dec 16, 2025

Uh oh!

juripetersen Dec 16, 2025

Uh oh!

novatechflow Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature/spark dataframes #646

Are you sure you want to change the base?

Feature/spark dataframes #646

Uh oh!

Conversation

novatechflow commented Dec 15, 2025

Summary

Next steps / follow-ups

Uh oh!

juripetersen Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

juripetersen Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

juripetersen Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

novatechflow Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants