Skip to content

Conversation

@yaugenst-flex
Copy link
Collaborator

@yaugenst-flex yaugenst-flex commented Jan 28, 2026

Summary

  • add configurable grace period to tolerate transient error states during web.monitor
  • route monitor status polling through the grace period and keep get_status strict by default
  • document the new web config option and add unit coverage

Why

Backend tasks can transiently enter error states during automatic retries while eventually succeeding. The client previously raised immediately on any error status, producing false-negative fatal errors in the CLI. This change lets the monitor wait out brief error windows (configurable) before raising, aligning CLI behavior with backend retry semantics.


Note

Introduces a grace period to prevent premature failures when backend transiently reports error statuses.

  • Extends get_status(task_id, *, error_grace_period=0.0) to optionally wait out ERROR_STATES and surface server error details if the grace period expires
  • Routes all monitor() status polling through _get_status() using config.web.monitor_error_grace_period (new field, default 60.0s)
  • Updates loops in monitor() to use the grace-aware status checks across all task types, including BatchTask
  • Documents the new config in config/README.md; adds monitor_error_grace_period to WebConfig in config/sections.py
  • Adds tests for recovery and expiration paths; adjusts test mocks to accept extra kwargs
  • CHANGELOG: notes fix for CLI monitoring failures on transient backend errors

Written by Cursor Bugbot for commit 37bff73. This will update automatically on new commits. Configure here.

Greptile Overview

Greptile Summary

This PR adds a configurable grace period to handle transient backend error states during monitoring, preventing false-negative fatal errors in the CLI when backend tasks temporarily enter error states during automatic retries.

Key changes:

  • Added error_grace_period parameter to get_status() that waits out transient error states before raising
  • Introduced monitor_error_grace_period config field (default 60s) in WebConfig to control grace period behavior
  • Routed all monitor() status checks through the grace period mechanism via config default
  • Added helper function _wait_out_error() that polls status until recovery or deadline expiration
  • Preserved strict behavior for direct get_status() calls (default 0s grace period)
  • Added comprehensive test coverage for both recovery and expiration scenarios
  • Updated all test mocks to accept kwargs for backward compatibility

The implementation correctly handles both BatchTask and regular task types, checks for "visualize" status (which maps to "success"), and fetches detailed error messages when grace period expires. The change aligns CLI behavior with backend retry semantics while maintaining backward compatibility.

Confidence Score: 4/5

  • This PR is safe to merge with low risk - it adds a defensive feature to handle transient errors without changing existing strict behavior
  • The implementation is well-tested with both recovery and expiration scenarios, maintains backward compatibility, and includes comprehensive documentation. The logic is sound and handles both task types correctly. Minor deduction for lack of test coverage on the BatchTask code path, though the implementation appears correct.
  • No files require special attention - implementation is clean and well-tested

Important Files Changed

Filename Overview
tidy3d/web/api/webapi.py Added grace period support to get_status and routed monitor through it with config default
tidy3d/config/sections.py Added monitor_error_grace_period config field with 60s default
tests/test_web/test_webapi.py Added two tests for grace period (recovery and expiration scenarios) and updated mock signatures

Sequence Diagram

sequenceDiagram
    participant User
    participant monitor
    participant get_status
    participant _wait_out_error
    participant Backend API

    User->>monitor: monitor(task_id)
    monitor->>get_status: get_status(task_id, error_grace_period=60s)
    get_status->>Backend API: get_info(task_id)
    Backend API-->>get_status: status="run_error"
    
    alt status in ERROR_STATES
        get_status->>_wait_out_error: wait out error
        loop until deadline or non-error status
            _wait_out_error->>Backend API: get_info(task_id)
            alt still in grace period
                Backend API-->>_wait_out_error: status="run_error"
                _wait_out_error->>_wait_out_error: sleep(REFRESH_TIME)
            else recovered
                Backend API-->>_wait_out_error: status="running"
                _wait_out_error-->>get_status: return "running"
            end
        end
        
        alt grace period expired with error
            get_status->>Backend API: get_error_json()
            Backend API-->>get_status: error details
            get_status-->>monitor: raise WebError
        else recovered within grace period
            get_status-->>monitor: return "running"
        end
    else status not in ERROR_STATES
        get_status-->>monitor: return status
    end
    
    monitor-->>User: continue monitoring
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@yaugenst-flex yaugenst-flex force-pushed the FXC-5149-cli-falsely-reports-fatal-error-when-backend-retries-task-and-later-succeeds branch from 5e252c0 to 37bff73 Compare January 28, 2026 09:02
@github-actions
Copy link
Contributor

Diff Coverage

Diff: origin/develop...HEAD, staged and unstaged changes

  • tidy3d/config/sections.py (100%)
  • tidy3d/web/api/webapi.py (65.9%): Missing lines 748,759-767,778,861,941,963

Summary

  • Total: 42 lines
  • Missing: 14 lines
  • Coverage: 66%

tidy3d/web/api/webapi.py

Lines 744-752

  744     """
  745 
  746     def _wait_out_error(fetch_status: Callable[[], str], raw_status: str | None) -> str | None:
  747         if error_grace_period <= 0:
! 748             return raw_status
  749         deadline = time.monotonic() + error_grace_period
  750         status = (raw_status or "").lower()
  751         while status in ERROR_STATES and time.monotonic() < deadline:
  752             time.sleep(REFRESH_TIME)

Lines 755-771

  755         return raw_status
  756 
  757     task = TaskFactory.get(task_id)
  758     if isinstance(task, BatchTask):
! 759         detail = task.detail()
! 760         raw_status = detail.status
! 761         status = (raw_status or "").lower()
! 762         if status in ERROR_STATES:
! 763             raw_status = _wait_out_error(lambda: task.detail().status, raw_status)
! 764             status = (raw_status or "").lower()
! 765         if status in ERROR_STATES:
! 766             _batch_detail_error(task.task_id)
! 767         return raw_status
  768     else:
  769         task_info = get_info(task_id)
  770         raw_status = task_info.status
  771         status = (raw_status or "").lower()

Lines 774-782

  774         if status in ERROR_STATES:
  775             raw_status = _wait_out_error(lambda: get_info(task_id).status, raw_status)
  776             status = (raw_status or "").lower()
  777             if status == "visualize":
! 778                 return "success"
  779         if status in ERROR_STATES:
  780             try:
  781                 # Try to obtain the error message
  782                 task = SimulationTask(taskId=task_id)

Lines 857-865

  857     def monitor_preprocess() -> None:
  858         """Periodically check the status."""
  859         status = _get_status()
  860         while status not in END_STATES and status != "running":
! 861             new_status = _get_status()
  862             if new_status != status:
  863                 status = new_status
  864                 if verbose and status != "running":
  865                     console.log(f"status = {status}")

Lines 937-945

  937 
  938     else:
  939         # non-verbose case, just keep checking until status is not running or perc_done >= 100
  940         perc_done, _ = get_run_info(task_id)
! 941         while perc_done is not None and perc_done < 100 and _get_status() == "running":
  942             perc_done, field_decay = get_run_info(task_id)
  943             time.sleep(RUN_REFRESH_TIME)
  944 
  945     # post processing

Lines 959-967

  959         if task_type in GUI_SUPPORTED_TASK_TYPES:
  960             url = _get_url(task_id)
  961             console.log(f"View simulation result at [blue underline][link={url}]'{url}'[/link].")
  962     else:
! 963         while _get_status() not in END_STATES:
  964             time.sleep(REFRESH_TIME)
  965 
  966 
  967 @wait_for_connection


# while running but before the percentage done is available, keep waiting
while get_run_info(task_id)[0] is None and get_status(task_id) == "running":
while get_run_info(task_id)[0] is None and _get_status() == "running":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the grace period also apply to intermediate hick-ups of get_run_info?

)

monitor_error_grace_period: NonNegativeFloat = Field(
60.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower this default? what's more realistic and user-friendly - 20-30s?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants