-
Notifications
You must be signed in to change notification settings - Fork 16
If master pod is destroyed and recreated, it takes ages for worker pods to timeout #84
Comments
Hmm, connections from the worker to the master ought to be done through an openshift service, so IP shouldn't matter. The master service should still be there and the pod selector should work for the new master. I'll have to poke at this. |
i wonder if the workers have already done their dns lookup for the IP though and only trying to connect with that information. they may not attempt to re-lookup the IP from the DNS until they are restarted. i feel like what is being seen in 3 & 4 is most likely originating in the Spark code. we probably need to investigate how the workers handle the dropout of a master. |
How are you kill the master pod, by the way? |
@elmiko, ack. I was thinking of the startup sequence. |
This PR seems quasi-related, and mentions (but does not address) the problem we're seeing here. |
does seem related, so i guess my next question is were you using the latest 2.2 openshift-spark image? (i'm assuming yes if you tested with our current upstream) |
@elmiko Yes,
I'm not entirely sure why the worker can't just re-resolve the DNS address of the master before attempting to reconnect. |
doh! thanks for the highlight. hmm, i'm wondering what our best recourse is here, sounds like the upstream may not have a good solution and i'm not sure how efficient it would be to destroy the workers if we detect a master going down. another question in all of this is, if the spark master goes away how much work would be lost by a driver application? we have had internal discussions about how upgrading a cluster might occur against a driver and i'm not sure you wouldn't just want to restart the driver if the master goes away. if the master contained state about the driver's progress and wasn't able to recompute the RDDs or DataFrames for the driver it would all be lost. this is at the edge of my spark internals knowledge, but perhaps @willb or @erikerlandson might have a better understanding. |
Yes, we have to map this all out. On the short list. |
One thing we could do is use Spark’s [built-in HA with a persistent volume](https://spark.apache.org/docs/latest/spark-standalone.html#single-node-recovery-with-local-file-system) (see the footnote about NFS). But that introduces some additional problems and I agree that the underlying issue should be fixed upstream in Spark.
|
Hi, I was wondering if this is still a problem? and if so is there a use case i can use to investigate against spark 2.3? |
cc @bleggett ^^ |
I haven't retested recently, so not sure. @rebeccaSimmonds19 If you want to investigate you should be able to stand up a pretty normal cluster and follow the steps in the original post and see if you can observe the same behavior with 2.3. If not, feel free to close this issue. |
This is still a problem in spark-2.3-latest on openshift 3.7 with the latest oshinko cluster |
I was testing to see what happens if the Spark master pod is killed.
What happens is that
Is there a way to handle steps 3 & 4 in a better way, so that a failure of the Master pod doesn't render the Spark cluster inoperable for a long period?
Or is this not an oshinko problem, but a problem with
Info appreciated!
The text was updated successfully, but these errors were encountered: