-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Improve reconnect handling #35
Comments
Just to add a little more context to this... Here is an example of it behaving sort of as expected but with one exception that I mention later. What we are looking at in these logs is intermittent outages of network connectivity or API endpoint availability which occur over an 18 hour period
At 21:07 connectivity seems restored and the connection and subscription to the websocket is restored. However, what i find a bit strange (notice that the same thing happens at around 15.26 earlier in the day and that my simple gitOps controller also had issues reaching github at the same time) is that the retry count does not reset between "incidents". This means that a retry count set to int(3), for example, would allow 3 network outages/API outages over any given range of time whether that be 3 days or 3 years. Would it make for more expected handling that if the retry count is not exceeded and the connection is restored with success, then the retry counter gets reset to zero? |
Hi, sorry for the late response. I've been a bit busy the past days and will continue to be quite busy. I'll try to have a look as soon as I can 😄 |
I've tested a bit and I think I'm just going to remove the backoff library and create the retry mechanism myself. Hopefully, that should help! I've created a branch that I've attached to the issue. I'll probably have more time to look at this over the weekend. I've added an "on_exception" argument to the |
In case it's useful info: Just caught another failure to reconnect after what appears to be a small Tibber outage... at ~22:00:
Then at around 08:00 in the morning i checked and it still had not reconnected and had not complained with any more error messages either. I restarted the "service" and all was fine. I would expect it to continue retrying until Tibber was back up and carry on because of the ridiculously large number in the parameters when i invoke it, which is:
Perhaps this specific error message is helpful... |
I've finally implemented a manual retry counter. Could you test this version out and see if it helps the issue of reconnecting? @JoshuaDodds The update is on the feature/exception-handling branch which is attached to this issue. Note that I haven't touched the retry mechanism for querying data from the tibber API ( |
Nice! Will test this soon and let you know... it's been really quite stable for more than a week now... Waiting for the next failure...
You are referencing the error i posted from when the Tibber API was down for some hours? If so... I only got two of these messages in the logs and after that nothing - but it never reconnected when the API was back... Looking at the code you referenced, I see:
Which seems to indicate that on a failed attempt to query the Tibber API it will retry once and then not again (by default settings). That would also explain why i only received two messages in the logs and why it never reconnected, right? |
Yes, that's almost true. I actually named the variable Note that the I thought it would make sense that it only tries once by default, because normally when you request data from a webservice, you either get a successful response or an error. As for websockets, it makes sense to retry during outages because it's a continuous connection that should be kept up. |
When running live feed stats for more than a day (usually between 16 and 36 hours), I am seeing a state that occurs where no reconnect is possible and a restart of the live feed is the only thing that will fix the issue shown in the logs before.
I am not sure if this is an issue with an underlying library (gql or another) or something that could be improved in tibber.py but an ideal scenario would be that these sort of frame and transport exceptions are handled and a full teardown and reconnect is done on the live feed thread.
Alternatively if you could recommend the best way i could catch such errors and initiate a reconnect myself that would also be helpful but I feel that it would improve tibber.py if such handling was built in.
Here are some logs of the issue that occurs. In this scenario i have an extremely high retry integer set with a 30 second retry interval:
2023-01-23 10:47:10 cerbomoticzGx: Backing off run_websocket_loop(...) for 0.1s (websockets.exceptions.ConnectionClosedError: no close frame received or sent) 2023-01-23 10:47:10 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffff819234c0>> in 0.1 seconds after 1 tries. 2023-01-23 10:47:10 cerbomoticzGx: Updating home information to check if real time consumption is enabled. 2023-01-23 10:47:11 cerbomoticzGx: Subscribing to websocket. 2023-01-23 10:47:11 cerbomoticzGx: Backing off run_websocket_loop(...) for 1.7s (gql.transport.exceptions.TransportClosed: Transport is not connected) 2023-01-23 10:47:11 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffff819234c0>> in 1.7 seconds after 2 tries. 2023-01-23 10:47:13 cerbomoticzGx: Updating home information to check if real time consumption is enabled. 2023-01-23 10:47:14 cerbomoticzGx: Backing off run_websocket_loop(...) for 2.0s (TypeError: catching classes that do not inherit from BaseException is not allowed) 2023-01-23 10:47:14 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffff819234c0>> in 2.0 seconds after 3 tries.
(and many hours later after continuing with the retries, still no success... issue was immediately resolved by restarting the script which start the live feed and imports tibber.py...)
2023-01-23 13:25:50 cerbomoticzGx: Updating home information to check if real time consumption is enabled. 2023-01-23 13:25:51 cerbomoticzGx: Subscribing to websocket. 2023-01-23 13:25:51 cerbomoticzGx: Backing off run_websocket_loop(...) for 84.8s (gql.transport.exceptions.TransportClosed: Transport is not connected) 2023-01-23 13:25:51 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffff819234c0>> in 84.8 seconds after 187 tries. 2023-01-23 13:27:16 cerbomoticzGx: Updating home information to check if real time consumption is enabled. 2023-01-23 13:27:16 cerbomoticzGx: Subscribing to websocket. 2023-01-23 13:27:17 cerbomoticzGx: Backing off run_websocket_loop(...) for 98.6s (gql.transport.exceptions.TransportClosed: Transport is not connected) 2023-01-23 13:27:17 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffff819234c0>> in 98.6 seconds after 188 tries.
The text was updated successfully, but these errors were encountered: