Deadlock Between MesosScheduler and JobTracker #66

windancer055 · 2015-09-11T21:02:14Z

Output from jStack:

Attaching to process ID 22531, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.45-b02
Deadlock Detection:

Found one Java-level deadlock:

"IPC Server handler 4 on 7676":
waiting to lock Monitor@0x00007f01ec11b858 (Object@0x00000000831924a0, a org/apache/hadoop/mapred/MesosScheduler),
which is held by "pool-1-thread-1"
"pool-1-thread-1":
waiting to lock Monitor@0x00007f01ec32f9d8 (Object@0x00000000830f4310, a org/apache/hadoop/mapred/JobTracker),
which is held by "IPC Server handler 4 on 7676"

Found a total of 1 deadlock.

DarinJ · 2015-09-24T14:49:46Z

I hit the same thing. Removing the synchronized (this.scheduler) block in MesosTracker.idleCheck removed the deadlock, it's calling methods on scheduler that are synchronized on this (context is MesosSchedules here) which is ripe for deadlock. @tarnfeld what are you trying to guard here? It looks like your worried something could be added be the tracker between the idleCounter >= idleCheckMax and the scheduler.killTracker (maybe in assignTasks). I'm going to look into a better way to achieve this.

Also noticed a lot of synchronized methods that just do logging, is this necessary? Seems like a lot of unnecessary blocking.

@windancer055 I'd suggest trying one of the release say 0.0.9 or 0.1.0 I've had good luck with them, though they done have framework auth, but it's easy to back port that.

RecursionTaoist · 2015-10-09T03:59:34Z

hadoop-mapreduce1-project (cdh5.3.2) & mesos-0.24.0

Output from jStack:
...

Found one Java-level deadlock:

"830282351@qtp-1012114812-14":
waiting to lock monitor 0x00002ad34c0294f8 (object 0x00000000fd74bcd8, a org.apache.hadoop.mapred.JobTracker),
which is held by "IPC Server handler 3 on 7676"
"IPC Server handler 3 on 7676":
waiting to lock monitor 0x00002ad350bedec8 (object 0x00000000fd854530, a org.apache.hadoop.mapred.MesosScheduler),
which is held by "pool-1-thread-1"
"pool-1-thread-1":
waiting to lock monitor 0x00002ad34c0294f8 (object 0x00000000fd74bcd8, a org.apache.hadoop.mapred.JobTracker),
which is held by "IPC Server handler 3 on 7676"

Java stack information for the threads listed above:

"830282351@qtp-1012114812-14":
at org.apache.hadoop.mapred.JobTracker.getMapTaskReports(JobTracker.java:3939)
- waiting to lock <0x00000000fd74bcd8> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.TaskGraphServlet.doGet(TaskGraphServlet.java:73)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1122)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:767)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
"IPC Server handler 3 on 7676":
at org.apache.hadoop.mapred.MesosScheduler.assignTasks(MesosScheduler.java:264)
- waiting to lock <0x00000000fd854530> (a org.apache.hadoop.mapred.MesosScheduler)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2969)
- locked <0x00000000fd74bcd8> (a org.apache.hadoop.mapred.JobTracker)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:483)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
"pool-1-thread-1":
at org.apache.hadoop.mapred.JobTracker.taskTrackers(JobTracker.java:2595)
- waiting to lock <0x00000000fd74bcd8> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.MesosTracker$3.run(MesosTracker.java:148)
- locked <0x00000000fd854530> (a org.apache.hadoop.mapred.MesosScheduler)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Found 1 deadlock.

tarnfeld · 2015-10-16T14:42:36Z

@tarnfeld what are you trying to guard here? It looks like your worried something could be added be the tracker between the idleCounter >= idleCheckMax and the scheduler.killTracker (maybe in assignTasks). I'm going to look into a better way to achieve this.

@DarinJ Hey! Thanks for taking a look into this, yeah the code/syncronized calls could definitely do with a tidy up. There's a bunch of old logging methods there too as you rightly mentioned that we can probably remove entirely.

Happy to help work on a patch, do you have something started already?

DarinJ · 2015-10-16T15:50:01Z

@tarnfeld I removed the synchronized (this.scheduler) block in MesosTracker.idleCheck no more deadlock. I didn't spend a lot of time testing this and I may have introduced a race condition, I still had the bug I mentioned in #65 which after trying to fix created a situation of killing idle reducers waiting for the shuffle phase.

I've got some ideas on how to fix, and I can write them down for you. But I'm trying to spend more of my time working on Myriad these days.

tarnfeld · 2015-10-16T15:53:10Z

Sure if you could share your thoughts even just here in a command that'd be great.

hermansc · 2015-11-27T08:01:38Z

I stumbled upon this issue too. Did a kill -3 to expose the information about the deadlock. Removed the lines which @DarinJ talked about and it's now at least working.

Attached the diff file.
fix_deadlock_issue_66.txt

tarnfeld · 2016-01-19T09:53:32Z

I have a feeling that this issue may now be resolved on master, could you report back @hermansc? I think the commit that introduced this was removed.

tarnfeld added bug scheduler labels Oct 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock Between MesosScheduler and JobTracker #66

Deadlock Between MesosScheduler and JobTracker #66

windancer055 commented Sep 11, 2015

DarinJ commented Sep 24, 2015

RecursionTaoist commented Oct 9, 2015

tarnfeld commented Oct 16, 2015

DarinJ commented Oct 16, 2015

tarnfeld commented Oct 16, 2015

hermansc commented Nov 27, 2015

tarnfeld commented Jan 19, 2016

Deadlock Between MesosScheduler and JobTracker #66

Deadlock Between MesosScheduler and JobTracker #66

Comments

windancer055 commented Sep 11, 2015

DarinJ commented Sep 24, 2015

RecursionTaoist commented Oct 9, 2015

Found one Java-level deadlock:

Java stack information for the threads listed above:

tarnfeld commented Oct 16, 2015

DarinJ commented Oct 16, 2015

tarnfeld commented Oct 16, 2015

hermansc commented Nov 27, 2015

tarnfeld commented Jan 19, 2016