Skip to content
This repository has been archived by the owner on Apr 27, 2022. It is now read-only.

If master pod is destroyed and recreated, it takes ages for worker pods to timeout #84

Open
bleggett opened this issue Oct 4, 2017 · 15 comments

Comments

@bleggett
Copy link

bleggett commented Oct 4, 2017

I was testing to see what happens if the Spark master pod is killed.

What happens is that

  1. A new Master pod is created by Kube (good)
  2. The existing Worker pods notice that the old Master pod is gone and try to reconnect (good)
  3. The existing worker pods keep trying to reconnect using the old Master pod IP (bad, the new pod has a new IP)
  4. The worker pods eventually exhaust their reconnection attempts and exit after a long period of time (not great)
  5. As the worker pods slowly do this, they are restarted by Kube and are able to connect to the new Master pod (good).

Is there a way to handle steps 3 & 4 in a better way, so that a failure of the Master pod doesn't render the Spark cluster inoperable for a long period?

Or is this not an oshinko problem, but a problem with

  • Spark
  • openshift-spark

Info appreciated!

@tmckayus
Copy link
Collaborator

tmckayus commented Oct 4, 2017

Hmm, connections from the worker to the master ought to be done through an openshift service, so IP shouldn't matter. The master service should still be there and the pod selector should work for the new master. I'll have to poke at this.

@elmiko
Copy link
Contributor

elmiko commented Oct 4, 2017

i wonder if the workers have already done their dns lookup for the IP though and only trying to connect with that information. they may not attempt to re-lookup the IP from the DNS until they are restarted.

i feel like what is being seen in 3 & 4 is most likely originating in the Spark code. we probably need to investigate how the workers handle the dropout of a master.

@tmckayus
Copy link
Collaborator

tmckayus commented Oct 4, 2017

How are you kill the master pod, by the way?

@tmckayus
Copy link
Collaborator

tmckayus commented Oct 4, 2017

@elmiko, ack. I was thinking of the startup sequence.

@bleggett
Copy link
Author

bleggett commented Oct 4, 2017

@elmiko That's my guess, that's what it looks like from the logs.

@tmckayus Deleting the master pod thru the Openshift web UI, leaving the "kill this pod immediately without waiting for graceful termination" box checked.

@bleggett
Copy link
Author

bleggett commented Oct 4, 2017

This PR seems quasi-related, and mentions (but does not address) the problem we're seeing here.

apache/spark#17821

@elmiko
Copy link
Contributor

elmiko commented Oct 4, 2017

does seem related, so i guess my next question is were you using the latest 2.2 openshift-spark image?

(i'm assuming yes if you tested with our current upstream)

@bleggett
Copy link
Author

bleggett commented Oct 4, 2017

@elmiko Yes, openshift-spark:2.2. The case we are hitting here is the case he explicitly calls out as not solved by his PR, I think:

There is still one potential issue though. When a master is restarted or takes over leadership, the work will use the address sent from the master to connect. If there is still a proxy between master and worker, the address may be wrong. However, there is no way to figure it out just in the worker.

I'm not entirely sure why the worker can't just re-resolve the DNS address of the master before attempting to reconnect.

@elmiko
Copy link
Contributor

elmiko commented Oct 4, 2017

doh!

thanks for the highlight. hmm, i'm wondering what our best recourse is here, sounds like the upstream may not have a good solution and i'm not sure how efficient it would be to destroy the workers if we detect a master going down.

another question in all of this is, if the spark master goes away how much work would be lost by a driver application?

we have had internal discussions about how upgrading a cluster might occur against a driver and i'm not sure you wouldn't just want to restart the driver if the master goes away. if the master contained state about the driver's progress and wasn't able to recompute the RDDs or DataFrames for the driver it would all be lost. this is at the edge of my spark internals knowledge, but perhaps @willb or @erikerlandson might have a better understanding.

@tmckayus
Copy link
Collaborator

tmckayus commented Oct 4, 2017

Yes, we have to map this all out. On the short list.

@willb
Copy link
Member

willb commented Oct 4, 2017 via email

@rebeccaSimmonds19
Copy link
Contributor

Hi, I was wondering if this is still a problem? and if so is there a use case i can use to investigate against spark 2.3?

@erikerlandson
Copy link

cc @bleggett ^^

@bleggett
Copy link
Author

I haven't retested recently, so not sure.

@rebeccaSimmonds19 If you want to investigate you should be able to stand up a pretty normal cluster and follow the steps in the original post and see if you can observe the same behavior with 2.3.

If not, feel free to close this issue.

@rebeccaSimmonds19
Copy link
Contributor

This is still a problem in spark-2.3-latest on openshift 3.7 with the latest oshinko cluster

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants