Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock Between MesosScheduler and JobTracker #66

Open
windancer055 opened this issue Sep 11, 2015 · 7 comments
Open

Deadlock Between MesosScheduler and JobTracker #66

windancer055 opened this issue Sep 11, 2015 · 7 comments

Comments

@windancer055
Copy link

Output from jStack:

Attaching to process ID 22531, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.45-b02
Deadlock Detection:

Found one Java-level deadlock:

"IPC Server handler 4 on 7676":
waiting to lock Monitor@0x00007f01ec11b858 (Object@0x00000000831924a0, a org/apache/hadoop/mapred/MesosScheduler),
which is held by "pool-1-thread-1"
"pool-1-thread-1":
waiting to lock Monitor@0x00007f01ec32f9d8 (Object@0x00000000830f4310, a org/apache/hadoop/mapred/JobTracker),
which is held by "IPC Server handler 4 on 7676"

Found a total of 1 deadlock.

@DarinJ
Copy link
Contributor

DarinJ commented Sep 24, 2015

I hit the same thing. Removing the synchronized (this.scheduler) block in MesosTracker.idleCheck removed the deadlock, it's calling methods on scheduler that are synchronized on this (context is MesosSchedules here) which is ripe for deadlock. @tarnfeld what are you trying to guard here? It looks like your worried something could be added be the tracker between the idleCounter >= idleCheckMax and the scheduler.killTracker (maybe in assignTasks). I'm going to look into a better way to achieve this.

Also noticed a lot of synchronized methods that just do logging, is this necessary? Seems like a lot of unnecessary blocking.

@windancer055 I'd suggest trying one of the release say 0.0.9 or 0.1.0 I've had good luck with them, though they done have framework auth, but it's easy to back port that.

@RecursionTaoist
Copy link

hadoop-mapreduce1-project (cdh5.3.2) & mesos-0.24.0

Output from jStack:
...

Found one Java-level deadlock:

"830282351@qtp-1012114812-14":
waiting to lock monitor 0x00002ad34c0294f8 (object 0x00000000fd74bcd8, a org.apache.hadoop.mapred.JobTracker),
which is held by "IPC Server handler 3 on 7676"
"IPC Server handler 3 on 7676":
waiting to lock monitor 0x00002ad350bedec8 (object 0x00000000fd854530, a org.apache.hadoop.mapred.MesosScheduler),
which is held by "pool-1-thread-1"
"pool-1-thread-1":
waiting to lock monitor 0x00002ad34c0294f8 (object 0x00000000fd74bcd8, a org.apache.hadoop.mapred.JobTracker),
which is held by "IPC Server handler 3 on 7676"

Java stack information for the threads listed above:

"830282351@qtp-1012114812-14":
at org.apache.hadoop.mapred.JobTracker.getMapTaskReports(JobTracker.java:3939)
- waiting to lock <0x00000000fd74bcd8> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.TaskGraphServlet.doGet(TaskGraphServlet.java:73)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1122)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:767)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
"IPC Server handler 3 on 7676":
at org.apache.hadoop.mapred.MesosScheduler.assignTasks(MesosScheduler.java:264)
- waiting to lock <0x00000000fd854530> (a org.apache.hadoop.mapred.MesosScheduler)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2969)
- locked <0x00000000fd74bcd8> (a org.apache.hadoop.mapred.JobTracker)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:483)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
"pool-1-thread-1":
at org.apache.hadoop.mapred.JobTracker.taskTrackers(JobTracker.java:2595)
- waiting to lock <0x00000000fd74bcd8> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.MesosTracker$3.run(MesosTracker.java:148)
- locked <0x00000000fd854530> (a org.apache.hadoop.mapred.MesosScheduler)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Found 1 deadlock.

@tarnfeld
Copy link
Member

@tarnfeld what are you trying to guard here? It looks like your worried something could be added be the tracker between the idleCounter >= idleCheckMax and the scheduler.killTracker (maybe in assignTasks). I'm going to look into a better way to achieve this.

@DarinJ Hey! Thanks for taking a look into this, yeah the code/syncronized calls could definitely do with a tidy up. There's a bunch of old logging methods there too as you rightly mentioned that we can probably remove entirely.

Happy to help work on a patch, do you have something started already?

@DarinJ
Copy link
Contributor

DarinJ commented Oct 16, 2015

@tarnfeld I removed the synchronized (this.scheduler) block in MesosTracker.idleCheck no more deadlock. I didn't spend a lot of time testing this and I may have introduced a race condition, I still had the bug I mentioned in #65 which after trying to fix created a situation of killing idle reducers waiting for the shuffle phase.

I've got some ideas on how to fix, and I can write them down for you. But I'm trying to spend more of my time working on Myriad these days.

@tarnfeld
Copy link
Member

Sure if you could share your thoughts even just here in a command that'd be great.

@hermansc
Copy link

I stumbled upon this issue too. Did a kill -3 to expose the information about the deadlock. Removed the lines which @DarinJ talked about and it's now at least working.

Attached the diff file.
fix_deadlock_issue_66.txt

@tarnfeld
Copy link
Member

I have a feeling that this issue may now be resolved on master, could you report back @hermansc? I think the commit that introduced this was removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants