-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apparent deadlock in Python bindings for go-jsonnet #484
Comments
Thanks a lot for reporting this. The folks at https://github.com/deepmind/kapitan reported similar problems and it's the primary reason why they're still on cpp-jsonnet. Having these additional details is really helpful. We fixed one issue that had a small potential to fix this in 0.17.0, but apparently that wasn't it. If you could get some info about what the hanged program is doing exactly that would be great. For example you can try to You can also try the newest version. We had one change that has a nonzero chance of affecting this: #478. Please let us know if you can think of any other relevant details. If you're at any point able to create a self-contained repro (even if it's big and complicated) that would of course be super helpful. |
Attached gdb and printed backtrace - looks like a call to Go's malloc triggers the deadlock?
|
Also I did try building the go-jsonnet Python bindings from the latest Github commit - deadlock was still there. |
I'm attaching a "relatively simple" example that reproduces the issue for me. It's still a bit of an abomination (blame Airflow and its poor testing strategy) but at least it's a starting point. The README file should explain the details - you'll need both Docker and Docker-Compose to run this. Follow-ups:
There seems to be something different about the context when Airflow executes the task, which appears to not play nice with the way gojsonnet invokes the Go runtime. |
That kapitan thread inspired me to take another look - here's a vastly simplified (no dependencies) test case that exercises the bug. Python multiprocessing has 3 ways to start a new process: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods "spawn" and "forkserver" both work fine. "fork" causes the deadlock (I'm on Linux btw) |
Appears suspiciously similar to this: golang/go#15538 It sounds like it might just be a fundamental issue with the Go runtime, since it's multithreaded. Even the Python documentation linked above states "Note that safely forking a multithreaded process is problematic." Unfortunately, Python uses "fork" as the default for Unix, so it's more likely people will run into this. |
It's not a general solution, but I was able to fix the issue for me by just deferring all of my "import _gojsonnet" calls until after the fork() occurs (however this "fix" won't work if there is any nested fork() pattern in one's program). There may not be any good general solution to this. In theory, if you could reliably create/destroy the Go runtime multiple times in the same program (and you were okay with this overhead) then you could wrap all _gjsonnet calls with create/destroy. I'm not an expert on Go's runtime but I really doubt that's possible. |
I agree with previous comments. |
Thanks a lot for investigating. This explanation sounds right, but it's very unfortunate. Maybe we can mitigate the issue by playing with gcstoptheworld setting (disabling the concurrency completely). Either way we should warn about the issue with fork + go library. In the longer term, we can think about solving this issue by spawning a separate process for Jsonnet evaluation in our wrapper and only the new process actually imports the go library. It has its own problems and limitations, though. |
I experienced the very same issue when using the gojsonnet python bindings together with the multiprocessing Pool.map() function. In my case, whenever I used evaluate_file or evaluate_snippet in a forked process, there would be a deadlock unrelated to jsonnet:
I worked around the issue making sure that jsonnet is not called in a forked process. The pool is used to parallelize http requests (threading in python is unfortunately a deadend), if jsonnet needs to be used on the results, it has to be delegated once the results of the spawned processes have been collected which is acceptable. Another workaround that worked was to set:
or
at the start of the program. That avoided the deadlock, but was not acceptable for me, as the spawned processes do not share the same filedescriptors. As I develop a command line tool that outputs logging data, logging from spawned processes was not available anymore. |
First off - apologies for lack of any "simple reproduction" of this insidious bug - it only occurs under a specific set of (rather complex) circumstances. I will do my best to describe them and what I've observed - perhaps that will be enough to inspire a possible cause/solution.
Issue: I observe what appears to be a deadlock when calling _gojsonnet.evaluate_file() under a certain set of conditions. There is 0% CPU usage and the process appears to be hanging on IO somewhere in the call to evaluate_file().
My observations so far
This bug is deterministic. When the "conditions" are met it seems to happen 100% of the time.
It only affects gojsonnet bindings (v0.16 and v0.17 ... possibly earlier too, but these are the only ones I've been able to easily install). In my testing it does not affect C++ jsonnet bindings.
I've only been able to trigger it when calling via PythonOperator on Apache Airflow (using LocalExecutor). In my setup, Airflow is running as a Docker container (hopefully not relevant but worth mentioning). I briefly dug into the PythonOperator code and it appears to launch a new Python subprocess in case that's important. I have not been able to trigger it by calling evaluate_file() manually from the Python REPL (verified both locally and inside my Docker container environment).
It seems to only happen when the generated json is rather large (~5MB in my case). If I shrink the size of my generated json to a few KB the issue goes away. I don't know exactly where this cutoff is, but the deterministic nature of the bug strongly suggests there is a hard cutoff somewhere.
Please let me know what you think or if you have any questions - I don't have enough knowledge of how Python's C API works to really debug this further on my own. There are unfortunately lots of moving parts here, but I filed as an issue in go-jsonnet given that the C++ bindings do not exhibit this issue.
The text was updated successfully, but these errors were encountered: