Utility function to get a setup & cleanup function for mapping each partition #456

squito · 2013-02-08T21:58:27Z

Often when mapping some RDD, you want to do a bit of setup before processing each partition, followed by cleanup at the end of the partition; this adds utility functions to make that easier.

I felt that this is worth including because its a little tricky to get right -- I needed to add a "CleanupIterator", and I have an example in the unit tests of how this fails w/out it. OTOH, I wasn't sure if this necessarily belongs in the spark api itself (eg., do we add a version of foreach with this also?)

We find it a useful addition, and so thought others might also ...

JoshRosen · 2013-02-08T22:24:14Z

core/src/test/scala/spark/RDDSuite.scala

@@ -4,6 +4,8 @@ import scala.collection.mutable.HashMap
 import org.scalatest.FunSuite
 import spark.SparkContext._
 import spark.rdd.{CoalescedRDD, PartitionPruningRDD}
+import spark.RDD.PartitionMapper
+import collection._


This style of import doesn't match the convention used in the rest of the codebase. It should probably be replaced with an import of scala.collection.mutable.Set, since that appears to be the only class that it's importing.

squito · 2013-02-08T22:52:55Z

I cleaned up the style issues, sorry about that.
I left that "failing" test in there ... happy to remove it if you want me to, just wanted to be clear if you want me to document the trouble w/ mapPartitions somewhere

stephenh · 2013-02-09T16:48:48Z

Hi Imran,

Have you seen the TaskContext and addOnCompleteCallback? That is what HadoopRDD uses to close the FileStream after all of the lines in a Hadoop file have been read.

You might be able to do what you're doing with just a custom RDD that did something like:

override def compute(s: Split, context: TaskContext): Iterator[(T, U)] = {
   setupDbConnection()
   context.addOnCompleteCallback { tearDownDbConnection()
   // call parent rdd or do own compute stuff

I believe this will achieve the same thing, as compute will be called on each partition, and you'll have start/stop hooks around the execution on each partition.

mateiz · 2013-02-09T18:36:16Z

I agree with Stephen here. The addOnCompleteCallback mechanism also makes sure to call your handler if the task throws an exception, which is important.

Also, can you add a similar method in the Java API? I guess you would need to create an interface in Java for the setup/close stuff too.

…en if exception is thrown

squito · 2013-02-11T05:30:13Z

good point, I definitely hadn't thought about ensuring cleanup w/ exceptions.

I've updated it to use onCompleteCallback. I also added it to the java api -- I added separate classes for PairRDDs & DoubleRDDs, dunno if there is a better way to do that.

stephenh · 2013-02-11T05:52:55Z

I wonder if this could be done with something more like decoration:

val rdd = sc.textFile(...).setupPartitions(dbSetupLogic).mapPartitions(...).cleanupPartitions(dbCloseLogic)

So there would be two new RDDs, PartitionSetupRDD that first invoked its setup function once/partition, then called firstParent.compute, and then PartitionCleanupRDD, that setup the complete callback for its cleanup function.

Not sure if the decoupling would lead to unintended/nonsensical use cases. But, just musing, then perhaps they could be used separately, if you only need one or the other, or without the map, which PartitionMapper currently forces you do to.

Also, I just like that this would use plain functions and not a new "PartitionMapper" interface--for some reason that doesn't feel quite right, but I can't think of a better name.

I see what you're trying to do though.

JoshRosen · 2013-02-11T06:23:01Z

core/src/main/scala/spark/api/java/JavaRDDLike.scala

+   * setup & cleanup that happens before & after computing each partition
+   */
+  def mapWithSetupAndCleanup[K,V](m: JavaPairPartitionMapper[T,K,V]): JavaPairRDD[K,V] = {
+    val scalaMapper = new PartitionMapper[T,(K,V)] {


CanJavaPairPartitionMapper<T, K, V> be an abstract class that extends or implements PartitionMapper<T, Tuple2<K, V>>? If you can do that, then you wouldn't have to wrap the the Java PartitionMapper to convert it into its Scala counterpart.

mateiz · 2013-02-11T06:25:15Z

Regarding Stephen's comment -- I think it's better to keep PartitionMapper a single object instead of doing functions, in case you need to share state among the setup, map and clean methods (e.g. you open some external resource, use it in your map, then close it).

stephenh · 2013-02-11T08:37:04Z

Ah, good point. That makes sense.

…n remove a helper object

squito · 2013-02-12T16:05:43Z

I updated JavaPairPartitionMapper, per Josh's suggestion. (We lose the ability for map to throw an exception, but that is already the case for the basic PartitionMapper.)

I tried doing the same thing for JavaDoubleRDD, but somehow I got stuck with weird manifest errors. First it complained:

[error] found : ClassManifest[scala.Double]
[error] required: ClassManifest[java.lang.Double]

Then when I switched to explicitly using a java.lang.Double manifest, it reversed:

[error] found : spark.RDD[java.lang.Double]
[error] required: spark.RDD[scala.Double]

so I just left it as is

squito · 2013-02-13T06:11:58Z

OK I think this is ready to go now. I got rid of the need for the helper object for JavaDoubleRDD, by just casting from java.lang.Double to scala.Double, and it seems happy. Also I put a throws Exception declaration on map in PartitionMapper, for the java api.

squito · 2013-03-16T17:27:21Z

Just curious what the status is on this -- waiting for some additional changes here, decided against merging it, or just haven't gotten to it yet.

mateiz · 2013-03-16T18:57:01Z

core/src/test/scala/spark/JavaAPISuite.java

@@ -401,6 +398,54 @@ public void mapPartitions() {
  }

  @Test
+  public void mapPartitionsWithSetupAndCleanup() {


Please use two spaces for the indentation in this file (it looks like maybe it's tabs, or multiple spaces)

mateiz · 2013-03-16T18:58:18Z

Sorry, just hadn't had a chance to look at it. It looks good but I made two small comments.

Conflicts: core/src/main/scala/spark/RDD.scala

squito · 2013-03-16T21:30:57Z

thanks! I've updated to take those comments into account.

squito · 2013-03-20T06:37:44Z

wow, somhow I totally missed committing changes to one file before ... hope you didn't waste time looking at it before, now its actually all there

AmplabJenkins · 2013-04-04T21:13:12Z

Can one of the admins verify this patch?

AmplabJenkins · 2013-04-10T20:49:15Z

I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission!

AmplabJenkins · 2013-04-18T22:05:44Z

I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission!

AmplabJenkins · 2013-08-05T21:33:58Z

Thank you for your pull request. An admin will review this request soon.

Imran Rashid added 2 commits February 6, 2013 21:34

sketch of how to do mapParitionsWithSetupAndCleanup

56d6f5c

add tests & docs

b648558

JoshRosen reviewed Feb 8, 2013
View reviewed changes

use immutable map in test for clarity; fixup style issues

aebda1d

Imran Rashid added 5 commits February 10, 2013 14:39

switch to using onCompleteCallback, to make sure cleanup is called ev…

e347e49

…en if exception is thrown

fixup! switch

a93d195

remove CleanupIterator as its no longer used

0807ee3

add java api for partition mapper

74c606a

more java api for partition mapper

b69c619

JoshRosen reviewed Feb 11, 2013
View reviewed changes

JavaPairPartitionMapper directly implements PartitionMapper, so we ca…

2a33765

…n remove a helper object

cleanup java api; add some tests

dc3dbbc

mateiz reviewed Mar 16, 2013
View reviewed changes

Merge branch 'master' into map_setup_cleanup

8387fdb

Conflicts: core/src/main/scala/spark/RDD.scala

rename to mapWithSetup; fix for split -> partition rename; fix spacing

442e941

doh, fix missing refactor of mapWithSetup

90c7fab

markhamstra mentioned this pull request Apr 1, 2013

Bump development version to 0.8.0 #549

Merged

Utility function to get a setup & cleanup function for mapping each partition #456

Are you sure you want to change the base?

Utility function to get a setup & cleanup function for mapping each partition #456

Uh oh!

Conversation

squito commented Feb 8, 2013

Uh oh!

JoshRosen Feb 8, 2013

Choose a reason for hiding this comment

Uh oh!

squito commented Feb 8, 2013

Uh oh!

stephenh commented Feb 9, 2013

Uh oh!

mateiz commented Feb 9, 2013

Uh oh!

squito commented Feb 11, 2013

Uh oh!

stephenh commented Feb 11, 2013

Uh oh!

JoshRosen Feb 11, 2013

Choose a reason for hiding this comment

Uh oh!

mateiz commented Feb 11, 2013

Uh oh!

stephenh commented Feb 11, 2013

Uh oh!

squito commented Feb 12, 2013

Uh oh!

squito commented Feb 13, 2013

Uh oh!

squito commented Mar 16, 2013

Uh oh!

mateiz Mar 16, 2013

Choose a reason for hiding this comment

Uh oh!

mateiz commented Mar 16, 2013

Uh oh!

squito commented Mar 16, 2013

Uh oh!

squito commented Mar 20, 2013

Uh oh!

AmplabJenkins commented Apr 4, 2013

Uh oh!

AmplabJenkins commented Apr 10, 2013

Uh oh!

AmplabJenkins commented Apr 18, 2013

Uh oh!

AmplabJenkins commented Aug 5, 2013

Uh oh!

Uh oh!