If master pod is destroyed and recreated, it takes ages for worker pods to timeout #84

bleggett · 2017-10-04T14:30:45Z

I was testing to see what happens if the Spark master pod is killed.

What happens is that

A new Master pod is created by Kube (good)
The existing Worker pods notice that the old Master pod is gone and try to reconnect (good)
The existing worker pods keep trying to reconnect using the old Master pod IP (bad, the new pod has a new IP)
The worker pods eventually exhaust their reconnection attempts and exit after a long period of time (not great)
As the worker pods slowly do this, they are restarted by Kube and are able to connect to the new Master pod (good).

Is there a way to handle steps 3 & 4 in a better way, so that a failure of the Master pod doesn't render the Spark cluster inoperable for a long period?

Or is this not an oshinko problem, but a problem with

Spark
openshift-spark

Info appreciated!

tmckayus · 2017-10-04T14:38:10Z

Hmm, connections from the worker to the master ought to be done through an openshift service, so IP shouldn't matter. The master service should still be there and the pod selector should work for the new master. I'll have to poke at this.

elmiko · 2017-10-04T14:40:35Z

i wonder if the workers have already done their dns lookup for the IP though and only trying to connect with that information. they may not attempt to re-lookup the IP from the DNS until they are restarted.

i feel like what is being seen in 3 & 4 is most likely originating in the Spark code. we probably need to investigate how the workers handle the dropout of a master.

tmckayus · 2017-10-04T14:40:35Z

How are you kill the master pod, by the way?

tmckayus · 2017-10-04T14:41:15Z

@elmiko, ack. I was thinking of the startup sequence.

bleggett · 2017-10-04T14:44:52Z

@elmiko That's my guess, that's what it looks like from the logs.

@tmckayus Deleting the master pod thru the Openshift web UI, leaving the "kill this pod immediately without waiting for graceful termination" box checked.

bleggett · 2017-10-04T15:12:40Z

This PR seems quasi-related, and mentions (but does not address) the problem we're seeing here.

apache/spark#17821

elmiko · 2017-10-04T15:18:59Z

does seem related, so i guess my next question is were you using the latest 2.2 openshift-spark image?

(i'm assuming yes if you tested with our current upstream)

bleggett · 2017-10-04T15:21:53Z

@elmiko Yes, openshift-spark:2.2. The case we are hitting here is the case he explicitly calls out as not solved by his PR, I think:

There is still one potential issue though. When a master is restarted or takes over leadership, the work will use the address sent from the master to connect. If there is still a proxy between master and worker, the address may be wrong. However, there is no way to figure it out just in the worker.

I'm not entirely sure why the worker can't just re-resolve the DNS address of the master before attempting to reconnect.

elmiko · 2017-10-04T15:26:50Z

doh!

thanks for the highlight. hmm, i'm wondering what our best recourse is here, sounds like the upstream may not have a good solution and i'm not sure how efficient it would be to destroy the workers if we detect a master going down.

another question in all of this is, if the spark master goes away how much work would be lost by a driver application?

we have had internal discussions about how upgrading a cluster might occur against a driver and i'm not sure you wouldn't just want to restart the driver if the master goes away. if the master contained state about the driver's progress and wasn't able to recompute the RDDs or DataFrames for the driver it would all be lost. this is at the edge of my spark internals knowledge, but perhaps @willb or @erikerlandson might have a better understanding.

tmckayus · 2017-10-04T15:39:33Z

Yes, we have to map this all out. On the short list.

willb · 2017-10-04T15:59:10Z

One thing we could do is use Spark’s [built-in HA with a persistent volume](https://spark.apache.org/docs/latest/spark-standalone.html#single-node-recovery-with-local-file-system) (see the footnote about NFS). But that introduces some additional problems and I agree that the underlying issue should be fixed upstream in Spark.

rebeccaSimmonds19 · 2018-09-12T16:07:44Z

Hi, I was wondering if this is still a problem? and if so is there a use case i can use to investigate against spark 2.3?

erikerlandson · 2018-09-12T16:33:39Z

cc @bleggett ^^

bleggett · 2018-10-18T16:29:47Z

I haven't retested recently, so not sure.

@rebeccaSimmonds19 If you want to investigate you should be able to stand up a pretty normal cluster and follow the steps in the original post and see if you can observe the same behavior with 2.3.

If not, feel free to close this issue.

rebeccaSimmonds19 · 2018-10-25T12:11:20Z

This is still a problem in spark-2.3-latest on openshift 3.7 with the latest oshinko cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If master pod is destroyed and recreated, it takes ages for worker pods to timeout #84

If master pod is destroyed and recreated, it takes ages for worker pods to timeout #84

bleggett commented Oct 4, 2017 •

edited

Loading

tmckayus commented Oct 4, 2017

elmiko commented Oct 4, 2017

tmckayus commented Oct 4, 2017

tmckayus commented Oct 4, 2017

bleggett commented Oct 4, 2017 •

edited

Loading

bleggett commented Oct 4, 2017 •

edited

Loading

elmiko commented Oct 4, 2017

bleggett commented Oct 4, 2017 •

edited

Loading

elmiko commented Oct 4, 2017

tmckayus commented Oct 4, 2017

willb commented Oct 4, 2017 via email

rebeccaSimmonds19 commented Sep 12, 2018

erikerlandson commented Sep 12, 2018

bleggett commented Oct 18, 2018

rebeccaSimmonds19 commented Oct 25, 2018

If master pod is destroyed and recreated, it takes ages for worker pods to timeout #84

If master pod is destroyed and recreated, it takes ages for worker pods to timeout #84

Comments

bleggett commented Oct 4, 2017 • edited Loading

tmckayus commented Oct 4, 2017

elmiko commented Oct 4, 2017

tmckayus commented Oct 4, 2017

tmckayus commented Oct 4, 2017

bleggett commented Oct 4, 2017 • edited Loading

bleggett commented Oct 4, 2017 • edited Loading

elmiko commented Oct 4, 2017

bleggett commented Oct 4, 2017 • edited Loading

elmiko commented Oct 4, 2017

tmckayus commented Oct 4, 2017

willb commented Oct 4, 2017 via email

rebeccaSimmonds19 commented Sep 12, 2018

erikerlandson commented Sep 12, 2018

bleggett commented Oct 18, 2018

rebeccaSimmonds19 commented Oct 25, 2018

bleggett commented Oct 4, 2017 •

edited

Loading

bleggett commented Oct 4, 2017 •

edited

Loading

bleggett commented Oct 4, 2017 •

edited

Loading

bleggett commented Oct 4, 2017 •

edited

Loading