-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProcessPool join hangs for 60 seconds due to intermittent deadlock #147
Comments
If I understand correctly, the occasional 60-second delay observed in The root cause seems to be:
The deadlock is recovered due to the |
As for the ways to fix it, I'm thinking of maybe letting the |
Signed-off-by: Matteo Cafasso <[email protected]>
Signed-off-by: Matteo Cafasso <[email protected]>
Hello, this is a known limitation and it's covered within these test cases. As the documentation states, Python multiprocessing guidelines fully apply to Pebble as well. In particular, I am addressing the following section:
The reason why this is an anti-patter is due to few reason.
The recommended way to handle these situations is to use files rather than sending large amount of data back-and-forth between the pipe. What is observed in most of the times is a significant increase of performance within the application itself. (Some references: 1, 2). Your main loop and your workers would write down the data in dedicated files and only share between each others the files path. This allows an optimal usage of the pipe (an empty pipe is a happy pipe) and leads to few benefits:
That said, I do agree that the current implementation is sub-optimal as it's thought around the usage recommendations. Yet the solution you propose cannot be accepted due to security reasons. The expectation over the Waiting for the pipe to be drained does not provide the same guarantees. A malicious worker process could keep slowly feeding the pipe literally locking the main loop indefinitely. Hence, a better solution should be devised. Indeed when we ask the pool to stop (compared to the I will come with a proper implementation in the upcoming days. In the meantime I have 2 recommendations for you:
|
Thank you for the detailed comment. I'll look into the workarounds you mentioned. However I have a question regarding the concern you expressed regarding the PR:
My understanding is that the PR doesn't introduce waiting for the pipe to be drained. The pipe draining is still executed on the background thread, and this thread's lifetime is only prolonged to (roughly) until I've also updated the PR to add a unit test for the situation you described (a spammy worker) - if I understood it correctly. |
Regarding your other suggestion on large amount of data: I've double-checked the application where the problem was originally observed (marxin/cvise#41), and the sizes of the messages are actually moderate: up to 2 KB. Still, the problem is observed with standard parallelization settings (e.g., 64 workers on my machine). Probably the specified deadlock scenario is bounded by 64 KiB divided by the number of workers - https://github.com/python/cpython/blob/a8dc6d6d44a141a8f839deb248a02148dcfb509e/Lib/multiprocessing/connection.py#L43 ? Asking the library consumer to use file-based I/O even for a kilobyte of data seems like an overkill. |
2Kb is definitely too small to be the issue in there but it seems the issue in the linked ticket is not due to that. The user had a dual-socket workstation and this might be a different matter. The side-effect with the stuffed pipe is known but I've never stumbled across issues with termination of the processes with small files. |
In order to stop cancelled or timing out tasks, the Pool Manager loop needs to acquire the channel lock to prevent any worker from accessing either the lock or the pipe during its termination. The previous logic would try to acquire the lock indefinitely. This led to possible deadlocks in case the Pool Manager was trying to stop cancelled tasks with the results channel filled up while the pool was terminated. This was due to the workers holding the channel lock while being unable to push the data as the Message Manager loop was not reading it anymore. The new Pool Manager loop does not try to acquire the lock indefinitely but rather polls it for availability. This allows the loop to continue assessing its state and correctly terminate upon request. The drawback of this implementation is that timing out or cancelled tasks might be stopped a little later in case the pipe is busy transferring lots of data.
The new logic is not affected by those deadlocks anymore.
Thanks for reporting this issue with proper means to reproduce it! I opted for an alternative solution and here is why:
The solution I opted for is not to attempt to acquire the channel locks indefinitely but to rather test their availability. This allows the Pool Manager loop to continue assessing the Pool's state and exit in case of need. Since this change is in the critical path and Pebble has a somewhat considerable adoption, I will need to run some thorough load tests before releasing this fix. Hopefully it's done by mid-March. |
I was able to reduce it down to the following example:
The text was updated successfully, but these errors were encountered: