-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock Between MesosScheduler and JobTracker #66
Comments
I hit the same thing. Removing the synchronized (this.scheduler) block in MesosTracker.idleCheck removed the deadlock, it's calling methods on scheduler that are synchronized on this (context is MesosSchedules here) which is ripe for deadlock. @tarnfeld what are you trying to guard here? It looks like your worried something could be added be the tracker between the idleCounter >= idleCheckMax and the scheduler.killTracker (maybe in assignTasks). I'm going to look into a better way to achieve this. Also noticed a lot of synchronized methods that just do logging, is this necessary? Seems like a lot of unnecessary blocking. @windancer055 I'd suggest trying one of the release say 0.0.9 or 0.1.0 I've had good luck with them, though they done have framework auth, but it's easy to back port that. |
hadoop-mapreduce1-project (cdh5.3.2) & mesos-0.24.0 Output from jStack: Found one Java-level deadlock:"830282351@qtp-1012114812-14": Java stack information for the threads listed above:"830282351@qtp-1012114812-14": Found 1 deadlock. |
@DarinJ Hey! Thanks for taking a look into this, yeah the code/syncronized calls could definitely do with a tidy up. There's a bunch of old logging methods there too as you rightly mentioned that we can probably remove entirely. Happy to help work on a patch, do you have something started already? |
@tarnfeld I removed the synchronized (this.scheduler) block in MesosTracker.idleCheck no more deadlock. I didn't spend a lot of time testing this and I may have introduced a race condition, I still had the bug I mentioned in #65 which after trying to fix created a situation of killing idle reducers waiting for the shuffle phase. I've got some ideas on how to fix, and I can write them down for you. But I'm trying to spend more of my time working on Myriad these days. |
Sure if you could share your thoughts even just here in a command that'd be great. |
I stumbled upon this issue too. Did a Attached the diff file. |
I have a feeling that this issue may now be resolved on master, could you report back @hermansc? I think the commit that introduced this was removed. |
Output from jStack:
Attaching to process ID 22531, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.45-b02
Deadlock Detection:
Found one Java-level deadlock:
"IPC Server handler 4 on 7676":
waiting to lock Monitor@0x00007f01ec11b858 (Object@0x00000000831924a0, a org/apache/hadoop/mapred/MesosScheduler),
which is held by "pool-1-thread-1"
"pool-1-thread-1":
waiting to lock Monitor@0x00007f01ec32f9d8 (Object@0x00000000830f4310, a org/apache/hadoop/mapred/JobTracker),
which is held by "IPC Server handler 4 on 7676"
Found a total of 1 deadlock.
The text was updated successfully, but these errors were encountered: