Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up DAGScheduler datastructures after job completes #414

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jhartlaub
Copy link

I added some fixes to reduce heap/prevent memory leak we found after running a load test (thousands of jobs). There may be more leaks, but this fixes a big one.

@mateiz
Copy link
Member

mateiz commented Jan 25, 2013

Good catch; the only thing I'm not 100% sure of is whether this will work when two active jobs depend on the same stage. (This can only happen if two threads submit a job at the same time, but that may occur in a multi-user system or in Spark Streaming.) I'll look into it more carefully, but if you've thought about this, let me know.

@jhartlaub
Copy link
Author

This will remove all stages submitted under a single job - if another job comes along and depends on the same stages, it will not work. In our cases, we have not resubmitted the same RDD's multiple types (using shark we create new RDD's every time).

I can see now where getShuffleMapStage in wich a lookup matches a stage by an ID from the RDD. I think a per-job reference-count for the stage-related maps might work- seem plausible?

Thank you for looking at this, btw. You seem very busy if the mailing list is any indication.

-Jon

@mateiz
Copy link
Member

mateiz commented Jan 26, 2013

Yeah, I think a reference count would be better. We should also make a test where multiple jobs depend on the same stage. For example:

val pairs = sc.parallelize(...)
val grouped = pairs.groupByKey()
grouped.filter(f1).count()
grouped.filter(f2).count()

Here both the actions should depend on that first shuffle map stage before the group-by.

The other thing I've been thinking about is whether we should just use WeakHashMap for most of the data structures. This way stages will only be kept for RDDs that the user program references somehow. I think this will be a fair amount of work but it might be the right thing in the long term.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission!

1 similar comment
@AmplabJenkins
Copy link

I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission!

@velvia
Copy link
Contributor

velvia commented Jul 25, 2013

Bump -- any progress on this one?

@mateiz
Copy link
Member

mateiz commented Jul 25, 2013

I think this will have to be done after Spark 0.8, because of the reference-counting issue above. However, @markhamstra has also been looking at the reference counting for some of his own work, I believe.

@AmplabJenkins
Copy link

Thank you for your pull request. An admin will review this request soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants