Skip to content

fix: Stabilize serial handling and reduce event-loop pressure#176

Open
EnTeQuAk wants to merge 1 commit into
grimmpp:feature-branch-v1from
EnTeQuAk:fix/stabilize-serial-handling
Open

fix: Stabilize serial handling and reduce event-loop pressure#176
EnTeQuAk wants to merge 1 commit into
grimmpp:feature-branch-v1from
EnTeQuAk:fix/stabilize-serial-handling

Conversation

@EnTeQuAk
Copy link
Copy Markdown

@EnTeQuAk EnTeQuAk commented Jan 19, 2026

Hey @grimmpp,

I ran into stability issues with v1.x on my Home Assistant running the latest 2026.x release. With my FGW14-USB connected, the entire HA installation stopped responding. Even lowering log levels didn't help. So I dug deeper, and this PR contains what I found.

The root causes I found so far:

  • high-volume serial telegrams were broadcast to every entity (O(N) fan-out)
  • blocking calls on the event loop (time.sleep, bus.join)
  • thread-unsafe calls from the serial thread
  • and listener/task leaks on reload

On HA 2026.x, the "Messages" unit also caused validation errors, which you already fixed in f3c4019 but it's not part of 1.x yet. This PR additionally removes suggested_unit_of_measurement (invalid when there’s no device class) and leaves the unit unset to avoid HA 2026.x validation errors. See home-assistant/core#151912. I'm not sure if this is actually the right fix, happy to hear your opinion on that.

This PR focuses on stability first (no blocking, no thread-unsafe calls) and performance second (cut fan-out, reduce allocations).

❗ This branch targets feature-branch-v1, I haven't checked the current main (v2) branch for similar issues as my agenda was to get my own Eltako installation running first 🙈 If this proves helpful, I can do a similar rundown on main too.

Thread safety fixes

The serial thread was calling HA APIs directly, which can cause random lockups. HA has a single event loop thread, and calling into it from other threads without synchronization is asking for trouble. The fix is to use call_soon_threadsafe to hand off work to the event loop, which is the standard pattern for this.

For send_message, I switched from async_dispatcher_send to the sync dispatcher_send. Sync entity methods can end up running in executor threads, and dispatcher_send is explicitly documented as thread-safe while async_dispatcher_send must only be called from the event loop. I was getting RuntimeErrors on HA 2026.x before this change.

The reconnect and unload logic had a similar problem. bus.join() is blocking and was running directly on the event loop. Now it runs via async_add_executor_job so HA stays responsive during reconnects. Same fix for cover tilt, which was using time.sleep() instead of await asyncio.sleep().

See: https://developers.home-assistant.io/docs/asyncio_thread_safety/, https://developers.home-assistant.io/docs/asyncio_blocking_operations/

Address-scoped dispatch

This was probably the biggest win. Previously, every telegram triggered callbacks on all entities, and each entity was then filtered by address. With lots of entities, that's O(N) work per telegram, which adds up fast.

The fix is to bake the source address into the event ID (eltako.gw_X.receive_message.sid_00-00-00-XX) and have each entity subscribe only to the addresses it cares about. The old broadcast-then-filter model is gone.

I had to normalize listen addresses before building event IDs. Raw bytes get wrapped into an AddressExpression-compatible tuple so thermostat IDs and similar cases don't break subscriptions.

Since state updates are now fully event-driven, I set _attr_should_poll = False. Polling was just adding unnecessary overhead.

Lifecycle cleanup

The cooling switch listener in climate.py was set up without storing the unsubscribe handle, which leaks listeners across reloads. I moved it to async_added_to_hass and registered it with async_on_remove. Also added async_will_remove_from_hass to properly cancel the periodic update task.

Same fix for EventListenerInfoField in sensor.py.

Smaller fixes

  • Gateway stats: Throttled to once per second. No need to update "last message received" on every telegram.
  • ID validation: FE/FF wireless sender IDs no longer trigger spurious warnings.
  • HA 2026.x unit validation: Removed suggested_unit_of_measurement (invalid when there's no device class). unit_of_measurement is now unset as well. See Fix _is_valid_suggested_unit in sensor platform home-assistant/core#151912.
  • Hot-path cleanup: _HANDLED_MSG_TYPES is now a class-level tuple, listen_to_addresses is a set, removed json.dumps from the binary_sensor debug path, removed .serialize().hex() from gateway debug logs.
  • Sender-ID validation: validate_sender_id now actually uses its argument.

I ran into stability issues with v1.x on my HomeAssistant running the latest 2026.x release. With my FGW14-USB connected, the entire HA installation stopped responding. Even lowering log levels didn't help. So I dug deeper and this PR contains what I found.

The root causes I found so far:

- high-volume serial telegrams were broadcast to every entity (O(N) fan-out)
- blocking calls on the event loop (time.sleep, bus.join)
- thread-unsafe calls from the serial thread
- and listener/task leaks on reload

On HA 2026.x, the "Messages" unit also caused validation errors, which you already fixed in f3c4019 but it's not part of 1.x yet. This PR additionally removes `suggested_unit_of_measurement` (invalid when there’s no device class) and leaves the unit unset to avoid HA 2026.x validation errors. See home-assistant/core#151912. I'm not sure if this is actually the right fix, happy to hear your opinion on that.

This PR focuses on stability first (no blocking, no thread-unsafe calls) and performance second (cut fan-out, reduce allocations).

**Thread safety fixes**

The serial thread was calling HA APIs directly, which can cause random lockups. HA has a single event loop thread, and calling into it from other threads without synchronization is asking for trouble. The fix is to use `call_soon_threadsafe` to hand off work to the event loop, which is the standard pattern for this.

For `send_message`, I switched from `async_dispatcher_send` to the sync `dispatcher_send`. Sync entity methods can end up running in executor threads, and `dispatcher_send` is explicitly documented as thread-safe while `async_dispatcher_send` must only be called from the event loop. I was getting `RuntimeErrors` on HA 2026.x before this change.

The reconnect and unload logic had a similar problem. `bus.join()` is blocking and was running directly on the event loop. Now it runs via `async_add_executor_job` so HA stays responsive during reconnects. Same fix for cover tilt, which was using `time.sleep()` instead of `await asyncio.sleep()`.

See: https://developers.home-assistant.io/docs/asyncio_thread_safety/, https://developers.home-assistant.io/docs/asyncio_blocking_operations/

**Address-scoped dispatch**

This was probably the biggest win. Previously every telegram triggered callbacks on all entities, and each entity then filtered by address. With lots of entities that's O(N) work per telegram, which adds up fast.

The fix is to bake the source address into the event ID (`eltako.gw_X.receive_message.sid_00-00-00-XX`) and have each entity subscribe only to the addresses it cares about. The old broadcast-then-filter model is gone.

I had to normalize listen addresses before building event IDs. Raw bytes get wrapped into an `AddressExpression`-compatible tuple so thermostat IDs and similar cases don't break subscriptions.

Since state updates are now fully event-driven, I set `_attr_should_poll = False`. Polling was just adding unnecessary overhead.

**Lifecycle cleanup**

The cooling switch listener in `climate.py` was set up without storing the unsubscribe handle, which leaks listeners across reloads. I moved it to `async_added_to_hass` and registered it with `async_on_remove`. Also added `async_will_remove_from_hass` to properly cancel the periodic update task.

Same fix for `EventListenerInfoField` in `sensor.py`.

**Smaller fixes**

- Gateway stats: Throttled to once per second. No need to update "last message received" on every telegram.
- ID validation: FE/FF wireless sender IDs no longer trigger spurious warnings.
- HA 2026.x unit validation: Removed `suggested_unit_of_measurement` (invalid when there's no device class). `unit_of_measurement` is now unset as well. See home-assistant/core#151912.
- Hot-path cleanup: `_HANDLED_MSG_TYPES` is now a class-level tuple, `listen_to_addresses` is a set, removed `json.dumps` from the `binary_sensor` debug path, removed `.serialize().hex()` from gateway debug logs.
- Sender-ID validation: `validate_sender_id` now actually uses its argument.
@EnTeQuAk EnTeQuAk force-pushed the fix/stabilize-serial-handling branch from b42bcb1 to 787a3a1 Compare January 19, 2026 22:35
@el-mojito
Copy link
Copy Markdown
Contributor

Hello @grimmpp, this PR contains some nice changes, it would be awesome if you find the time to take a look. Also, you have been working on an updated v2 version of this integration for a while, but I think that branch has been stale for a while now. Do you have any updates/plans on when you might find to continue your awesome work here?

@EnTeQuAk
Copy link
Copy Markdown
Author

If it's any help, I can split this pull request up into smaller chunks. Though, all of those fixes together really made it click for my home-installation. @el-mojito did you have a chance to try it out?

@el-mojito
Copy link
Copy Markdown
Contributor

If it's any help, I can split this pull request up into smaller chunks. Though, all of those fixes together really made it click for my home-installation. @el-mojito did you have a chance to try it out?

I am a cautious person, I am currently manually adopting smaller chunks of your changes to see if something breaks 😄 it also has a nice learning effect.

@EnTeQuAk
Copy link
Copy Markdown
Author

Nice, let me know how that goes. Happy to help if you're running into any problems.

self._connection_state_handler = None
self._received_message_count_handler = None
self._last_stats_update = 0.0 # monotonic timestamp for throttling
self._stats_update_interval = 1.0 # seconds between stats updates to HA
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-reading the pull request, I think this could be lowered to 0.5 to make the system a bit snappier. I'll try that out. But, throttling updates to HA definitely helped my HASS Yellow-based (Raspi) system cope with the traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants