Polling resources get forever stuck after a network disconnect #876
Replies: 11 comments 8 replies
-
|
Ah, wait, that is not latest |
Beta Was this translation helpful? Give feedback.
-
|
Oh, very interesting - I was going to ctrl+c before I go to sleep, but mgmt just hung and never exited. I had to pass SIGKILL to make it stop. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for reporting this. Try with only one vm please. And if it hangs, then please press ^\ see: https://purpleidea.com/blog/2016/02/15/debugging-golang-programs/ and paste the full trace somewhere. Please make sure you're on git master too. It is possible that is an issue with the specific resource not handling a context... When we find those, we should absolutely fix them. I don't see anything overly obvious in the hetzner code, but I also didn't write it. It could also do with a bump in the library, so if you want to go get -u the deps for it, I am happy to merge those changes. Lastly there could in theory be a deadlock in the engine related to polling or another bug, but if so, we'd find it with a trace I think. I actually have a pending "be extra extra careful to not deadlock" patch queued, but I think it's not even required. I should hopefully merge it anyways this month. Thank you for playing with all this! |
Beta Was this translation helpful? Give feedback.
-
|
Here is that branch if you want to test it. I haven't test it at all yet. https://github.com/purpleidea/mgmt/tree/feat/more-context-changes |
Beta Was this translation helpful? Give feedback.
-
|
Got the same repro on With one vm only, still on
Will do in a spare moment and re-test everything works (outside/regardless the bug here) - I see there's a v2, but all the methods used in the resource seem compatible at a glance. |
Beta Was this translation helpful? Give feedback.
-
|
I've made a branch with the hetzner deps bumped: https://github.com/purpleidea/mgmt/tree/feat/bump-hetzner If you're interested in trying that, lmk if you can repro there before I dig too deeply into the trace. |
Beta Was this translation helpful? Give feedback.
-
|
I would also need the full exact mcl code (you can remove a password string and replace it with "hunter2" in your paste) and the full cli of how you ran it. If you test on: https://github.com/purpleidea/mgmt/tree/feat/bump-hetzner that's preferred just to avoid me tracking down a hetzer bug that's already been fixed ;) |
Beta Was this translation helpful? Give feedback.
-
|
Sidenote/another thing I've observed: Couple times upon initial machine creation I got: I only got that once when creating a fresh machine, not when it already exists - if it exists, we go into polling correctly, so I assume we're just hitting some slowness on Hetzner's end. |
Beta Was this translation helpful? Give feedback.
-
|
Reproed the issue on the branch with bumped hetzner. My MCL: And coredump: What is interesting is once I get it into this state, I can do |
Beta Was this translation helpful? Give feedback.
-
|
In this trace: It's not clear that you pressed ^C ... Did you?? |
Beta Was this translation helpful? Give feedback.
-
|
Okay, good news, bad news! After digging through the trace, here is the offending part: The bad part is here: So a bug in hetzner that's not listening to the ctx ... I assume, I didn't dig into it... I started but then I noticed: It turns out when I bumped to the latest version, I didn't realize it was the latest 1.x version :/ I've now patched this in: https://github.com/purpleidea/mgmt/tree/feat/hetzner-v2 If it passes the tests I'll merge, but do please test and let me know if that fixes the issue. If not, when you press ^C if it doesn't shutdown, do another trace and I'll read it again. Thanks for your patience and for your very helpful reporting. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
When I create a resource that is poll-based, I see that if it hits an issue hitting the API it relies on, it later doesn't keep reconciling it.
I observed it first with some API errors on the providers end, but I could soon reproduce by turning turning my network adapter on/off.
I suspect a context isn't passing a timeout somewhere, but I didn't get to dig into the code to find whether it's an issue with the resource or polling itself.
Reproduction:
Tested on latest main branch, built my binary off of
17082d012f60dd2e7839476690227e83310f0ecfBeta Was this translation helpful? Give feedback.
All reactions