fix: Stabilize serial handling and reduce event-loop pressure#176
fix: Stabilize serial handling and reduce event-loop pressure#176EnTeQuAk wants to merge 1 commit into
Conversation
I ran into stability issues with v1.x on my HomeAssistant running the latest 2026.x release. With my FGW14-USB connected, the entire HA installation stopped responding. Even lowering log levels didn't help. So I dug deeper and this PR contains what I found. The root causes I found so far: - high-volume serial telegrams were broadcast to every entity (O(N) fan-out) - blocking calls on the event loop (time.sleep, bus.join) - thread-unsafe calls from the serial thread - and listener/task leaks on reload On HA 2026.x, the "Messages" unit also caused validation errors, which you already fixed in f3c4019 but it's not part of 1.x yet. This PR additionally removes `suggested_unit_of_measurement` (invalid when there’s no device class) and leaves the unit unset to avoid HA 2026.x validation errors. See home-assistant/core#151912. I'm not sure if this is actually the right fix, happy to hear your opinion on that. This PR focuses on stability first (no blocking, no thread-unsafe calls) and performance second (cut fan-out, reduce allocations). **Thread safety fixes** The serial thread was calling HA APIs directly, which can cause random lockups. HA has a single event loop thread, and calling into it from other threads without synchronization is asking for trouble. The fix is to use `call_soon_threadsafe` to hand off work to the event loop, which is the standard pattern for this. For `send_message`, I switched from `async_dispatcher_send` to the sync `dispatcher_send`. Sync entity methods can end up running in executor threads, and `dispatcher_send` is explicitly documented as thread-safe while `async_dispatcher_send` must only be called from the event loop. I was getting `RuntimeErrors` on HA 2026.x before this change. The reconnect and unload logic had a similar problem. `bus.join()` is blocking and was running directly on the event loop. Now it runs via `async_add_executor_job` so HA stays responsive during reconnects. Same fix for cover tilt, which was using `time.sleep()` instead of `await asyncio.sleep()`. See: https://developers.home-assistant.io/docs/asyncio_thread_safety/, https://developers.home-assistant.io/docs/asyncio_blocking_operations/ **Address-scoped dispatch** This was probably the biggest win. Previously every telegram triggered callbacks on all entities, and each entity then filtered by address. With lots of entities that's O(N) work per telegram, which adds up fast. The fix is to bake the source address into the event ID (`eltako.gw_X.receive_message.sid_00-00-00-XX`) and have each entity subscribe only to the addresses it cares about. The old broadcast-then-filter model is gone. I had to normalize listen addresses before building event IDs. Raw bytes get wrapped into an `AddressExpression`-compatible tuple so thermostat IDs and similar cases don't break subscriptions. Since state updates are now fully event-driven, I set `_attr_should_poll = False`. Polling was just adding unnecessary overhead. **Lifecycle cleanup** The cooling switch listener in `climate.py` was set up without storing the unsubscribe handle, which leaks listeners across reloads. I moved it to `async_added_to_hass` and registered it with `async_on_remove`. Also added `async_will_remove_from_hass` to properly cancel the periodic update task. Same fix for `EventListenerInfoField` in `sensor.py`. **Smaller fixes** - Gateway stats: Throttled to once per second. No need to update "last message received" on every telegram. - ID validation: FE/FF wireless sender IDs no longer trigger spurious warnings. - HA 2026.x unit validation: Removed `suggested_unit_of_measurement` (invalid when there's no device class). `unit_of_measurement` is now unset as well. See home-assistant/core#151912. - Hot-path cleanup: `_HANDLED_MSG_TYPES` is now a class-level tuple, `listen_to_addresses` is a set, removed `json.dumps` from the `binary_sensor` debug path, removed `.serialize().hex()` from gateway debug logs. - Sender-ID validation: `validate_sender_id` now actually uses its argument.
b42bcb1 to
787a3a1
Compare
|
Hello @grimmpp, this PR contains some nice changes, it would be awesome if you find the time to take a look. Also, you have been working on an updated v2 version of this integration for a while, but I think that branch has been stale for a while now. Do you have any updates/plans on when you might find to continue your awesome work here? |
|
If it's any help, I can split this pull request up into smaller chunks. Though, all of those fixes together really made it click for my home-installation. @el-mojito did you have a chance to try it out? |
I am a cautious person, I am currently manually adopting smaller chunks of your changes to see if something breaks 😄 it also has a nice learning effect. |
|
Nice, let me know how that goes. Happy to help if you're running into any problems. |
| self._connection_state_handler = None | ||
| self._received_message_count_handler = None | ||
| self._last_stats_update = 0.0 # monotonic timestamp for throttling | ||
| self._stats_update_interval = 1.0 # seconds between stats updates to HA |
There was a problem hiding this comment.
re-reading the pull request, I think this could be lowered to 0.5 to make the system a bit snappier. I'll try that out. But, throttling updates to HA definitely helped my HASS Yellow-based (Raspi) system cope with the traffic.
Hey @grimmpp,
I ran into stability issues with v1.x on my Home Assistant running the latest 2026.x release. With my FGW14-USB connected, the entire HA installation stopped responding. Even lowering log levels didn't help. So I dug deeper, and this PR contains what I found.
The root causes I found so far:
On HA 2026.x, the "Messages" unit also caused validation errors, which you already fixed in f3c4019 but it's not part of 1.x yet. This PR additionally removes
suggested_unit_of_measurement(invalid when there’s no device class) and leaves the unit unset to avoid HA 2026.x validation errors. See home-assistant/core#151912. I'm not sure if this is actually the right fix, happy to hear your opinion on that.This PR focuses on stability first (no blocking, no thread-unsafe calls) and performance second (cut fan-out, reduce allocations).
❗ This branch targets
feature-branch-v1, I haven't checked the currentmain(v2) branch for similar issues as my agenda was to get my own Eltako installation running first 🙈 If this proves helpful, I can do a similar rundown onmaintoo.Thread safety fixes
The serial thread was calling HA APIs directly, which can cause random lockups. HA has a single event loop thread, and calling into it from other threads without synchronization is asking for trouble. The fix is to use
call_soon_threadsafeto hand off work to the event loop, which is the standard pattern for this.For
send_message, I switched fromasync_dispatcher_sendto the syncdispatcher_send. Sync entity methods can end up running in executor threads, anddispatcher_sendis explicitly documented as thread-safe whileasync_dispatcher_sendmust only be called from the event loop. I was gettingRuntimeErrorson HA 2026.x before this change.The reconnect and unload logic had a similar problem.
bus.join()is blocking and was running directly on the event loop. Now it runs viaasync_add_executor_jobso HA stays responsive during reconnects. Same fix for cover tilt, which was usingtime.sleep()instead ofawait asyncio.sleep().See: https://developers.home-assistant.io/docs/asyncio_thread_safety/, https://developers.home-assistant.io/docs/asyncio_blocking_operations/
Address-scoped dispatch
This was probably the biggest win. Previously, every telegram triggered callbacks on all entities, and each entity was then filtered by address. With lots of entities, that's O(N) work per telegram, which adds up fast.
The fix is to bake the source address into the event ID (
eltako.gw_X.receive_message.sid_00-00-00-XX) and have each entity subscribe only to the addresses it cares about. The old broadcast-then-filter model is gone.I had to normalize listen addresses before building event IDs. Raw bytes get wrapped into an
AddressExpression-compatible tuple so thermostat IDs and similar cases don't break subscriptions.Since state updates are now fully event-driven, I set
_attr_should_poll = False. Polling was just adding unnecessary overhead.Lifecycle cleanup
The cooling switch listener in
climate.pywas set up without storing the unsubscribe handle, which leaks listeners across reloads. I moved it toasync_added_to_hassand registered it withasync_on_remove. Also addedasync_will_remove_from_hassto properly cancel the periodic update task.Same fix for
EventListenerInfoFieldinsensor.py.Smaller fixes
suggested_unit_of_measurement(invalid when there's no device class).unit_of_measurementis now unset as well. See Fix _is_valid_suggested_unit in sensor platform home-assistant/core#151912._HANDLED_MSG_TYPESis now a class-level tuple,listen_to_addressesis a set, removedjson.dumpsfrom thebinary_sensordebug path, removed.serialize().hex()from gateway debug logs.validate_sender_idnow actually uses its argument.