Skip to content

INTERACT=1 deadlocks device tests when ButtonRequest.pages > 1 #6950

@andrewkozlik

Description

@andrewkozlik

Summary

Running device tests with INTERACT=1 hangs in any signing flow whose ButtonRequest reports more than one page. After the user confirms on the device, the device cleanly moves past the BR and is left waiting for the next protocol message; the host stays stuck inside _callback_button and never sends the next TxAck. Reproduces on T2T1 — not THP- or Eckhart-specific.

Reproduction

INTERACT=1 pytest tests/device_tests/bitcoin/test_signtx.py -k test_one_one_fee

Confirm each prompt on the device. The test hangs after the last user confirmation. Last device-side log:

trezor.ui DBG UI task exited by itself: <generator object 'br_task' ...>
trezor.loop DBG close spawned task: <generator object '_waiting_screen' ...>
trezor.wire.codec.codec_context DBG [USBIF] write: TxRequest
trezor.wire.codec.codec_context DBG [USBIF] expect: TxAckInput

The device has acknowledged the user's confirmation and is waiting on TxAckInput. The host is the side that's stuck.

Root cause

Under INTERACT=1, conftest.py sets auto_interact=FalseDebugLink.allow_interactions = False.

_callback_button (client.py:555) writes ButtonAck, then runs the UI callback before reading the next message:

session.write(messages.ButtonAck())
self.app._callback_button(msg)   # ← blocks here
return session.read()

The UI callback flows into DebugUI.default_input_flow_paginate_and_confirm(br.pages) (debuglink.py:1070, :1184). That issues pages debug decisions (swipe/click + press_yes). Each _decision() under allow_interactions=False does (debuglink.py:810-812):

if not self.allow_interactions:
    self.wait_layout()
    return

wait_layout() blocks on NEXT_LAYOUT. After the user confirms on-device:

  1. The first wait_layout() returns when the device transitions through the BR.
  2. The remaining wait_layout() calls block — the device is now on a non-interactive screen (_waiting_screen) and will produce no further layout changes until the host sends the next TxAck.
  3. The host can't send TxAck because it's still inside _callback_button.

Deadlock: host waits for a layout change, device waits for a protocol message.

When this broke

dd496c1a5d ("feat(core): unify RustLayout, implement single global layout") changed _decision for allow_interactions=False from return None (no-op) to return self.wait_layout(). Before that change, the pages-many decisions issued by the input flow were all no-ops under INTERACT and the host did not block on layout transitions at all. The synchronization point was implicitly the device's next protocol message, which is exactly the right point.

Local hotfix

To unblock myself I short-circuited default_input_flow when allow_interactions=False:

def default_input_flow(self, on_page=None):
    while True:
        br = yield
        if not self.debuglink.allow_interactions:
            continue
        ...

This restores pre-dd496c1a5d behavior on this path, but:

  • It throws away the host/device synchronization point that dd496c1a5d introduced (the single wait_layout() per decision).
  • It disables on-host PIN entry on T1 in INTERACT mode (since the PinEntry branch is also skipped).

So it's a workaround, not a fix.

Suggested clean fix

The fundamental issue is that under INTERACT the host shouldn't try to anticipate how many layout transitions the user will produce — br.pages is the host's idea of how many auto-clicks would be needed, not how many transitions a human will cause. Two plausible directions, in increasing order of niceness:

  1. Collapse the input flow to a single sync point under allow_interactions=False. Instead of running _paginate_and_confirm, do exactly one wait_layout(wait_for_external_change=True) per BR (and keep the PinEntry branch). The single wait_layout call preserves the sync intent of dd496c1a5d without overcommitting on br.pages.

  2. Remove wait_layout() from _decision under allow_interactions=False and rely on the protocol-level read in _callback_button (session.read() after the UI callback) as the natural sync point. This is the cleanest in terms of layering — debug-link stops trying to gate the protocol — but it requires confirming that no other caller of _decision depends on the layout having advanced before returning.

(1) is the smaller, safer change. (2) is what the pre-dd496c1a5d design effectively did.

cc @matejcik

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working as expectedtestsAutomated integration tests

    Type

    No fields configured for Bug.

    Projects

    Status

    🎯 To do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions