fix(webapi): add grace period for transient error states #3212

yaugenst-flex · 2026-01-28T08:53:41Z

Summary

add configurable grace period to tolerate transient error states during web.monitor
route monitor status polling through the grace period and keep get_status strict by default
document the new web config option and add unit coverage

Why

Backend tasks can transiently enter error states during automatic retries while eventually succeeding. The client previously raised immediately on any error status, producing false-negative fatal errors in the CLI. This change lets the monitor wait out brief error windows (configurable) before raising, aligning CLI behavior with backend retry semantics.

Note

Introduces a grace period to prevent premature failures when backend transiently reports error statuses.

Extends get_status(task_id, *, error_grace_period=0.0) to optionally wait out ERROR_STATES and surface server error details if the grace period expires
Routes all monitor() status polling through _get_status() using config.web.monitor_error_grace_period (new field, default 60.0s)
Updates loops in monitor() to use the grace-aware status checks across all task types, including BatchTask
Documents the new config in config/README.md; adds monitor_error_grace_period to WebConfig in config/sections.py
Adds tests for recovery and expiration paths; adjusts test mocks to accept extra kwargs
CHANGELOG: notes fix for CLI monitoring failures on transient backend errors

^{Written by Cursor Bugbot for commit 37bff73. This will update automatically on new commits. Configure here.}

Greptile Overview

Greptile Summary

This PR adds a configurable grace period to handle transient backend error states during monitoring, preventing false-negative fatal errors in the CLI when backend tasks temporarily enter error states during automatic retries.

Key changes:

Added error_grace_period parameter to get_status() that waits out transient error states before raising
Introduced monitor_error_grace_period config field (default 60s) in WebConfig to control grace period behavior
Routed all monitor() status checks through the grace period mechanism via config default
Added helper function _wait_out_error() that polls status until recovery or deadline expiration
Preserved strict behavior for direct get_status() calls (default 0s grace period)
Added comprehensive test coverage for both recovery and expiration scenarios
Updated all test mocks to accept kwargs for backward compatibility

The implementation correctly handles both BatchTask and regular task types, checks for "visualize" status (which maps to "success"), and fetches detailed error messages when grace period expires. The change aligns CLI behavior with backend retry semantics while maintaining backward compatibility.

Confidence Score: 4/5

This PR is safe to merge with low risk - it adds a defensive feature to handle transient errors without changing existing strict behavior
The implementation is well-tested with both recovery and expiration scenarios, maintains backward compatibility, and includes comprehensive documentation. The logic is sound and handles both task types correctly. Minor deduction for lack of test coverage on the BatchTask code path, though the implementation appears correct.
No files require special attention - implementation is clean and well-tested

Important Files Changed

Filename	Overview
tidy3d/web/api/webapi.py	Added grace period support to `get_status` and routed monitor through it with config default
tidy3d/config/sections.py	Added `monitor_error_grace_period` config field with 60s default
tests/test_web/test_webapi.py	Added two tests for grace period (recovery and expiration scenarios) and updated mock signatures

Sequence Diagram

sequenceDiagram
    participant User
    participant monitor
    participant get_status
    participant _wait_out_error
    participant Backend API

    User->>monitor: monitor(task_id)
    monitor->>get_status: get_status(task_id, error_grace_period=60s)
    get_status->>Backend API: get_info(task_id)
    Backend API-->>get_status: status="run_error"
    
    alt status in ERROR_STATES
        get_status->>_wait_out_error: wait out error
        loop until deadline or non-error status
            _wait_out_error->>Backend API: get_info(task_id)
            alt still in grace period
                Backend API-->>_wait_out_error: status="run_error"
                _wait_out_error->>_wait_out_error: sleep(REFRESH_TIME)
            else recovered
                Backend API-->>_wait_out_error: status="running"
                _wait_out_error-->>get_status: return "running"
            end
        end
        
        alt grace period expired with error
            get_status->>Backend API: get_error_json()
            Backend API-->>get_status: error details
            get_status-->>monitor: raise WebError
        else recovered within grace period
            get_status-->>monitor: return "running"
        end
    else status not in ERROR_STATES
        get_status-->>monitor: return status
    end
    
    monitor-->>User: continue monitoring

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

github-actions · 2026-01-28T09:28:03Z

Diff Coverage

Diff: origin/develop...HEAD, staged and unstaged changes

tidy3d/config/sections.py (100%)
tidy3d/web/api/webapi.py (65.9%): Missing lines 748,759-767,778,861,941,963

Summary

Total: 42 lines
Missing: 14 lines
Coverage: 66%

tidy3d/web/api/webapi.py

Lines 744-752

  744     """
  745 
  746     def _wait_out_error(fetch_status: Callable[[], str], raw_status: str | None) -> str | None:
  747         if error_grace_period <= 0:
! 748             return raw_status
  749         deadline = time.monotonic() + error_grace_period
  750         status = (raw_status or "").lower()
  751         while status in ERROR_STATES and time.monotonic() < deadline:
  752             time.sleep(REFRESH_TIME)

Lines 755-771

  755         return raw_status
  756 
  757     task = TaskFactory.get(task_id)
  758     if isinstance(task, BatchTask):
! 759         detail = task.detail()
! 760         raw_status = detail.status
! 761         status = (raw_status or "").lower()
! 762         if status in ERROR_STATES:
! 763             raw_status = _wait_out_error(lambda: task.detail().status, raw_status)
! 764             status = (raw_status or "").lower()
! 765         if status in ERROR_STATES:
! 766             _batch_detail_error(task.task_id)
! 767         return raw_status
  768     else:
  769         task_info = get_info(task_id)
  770         raw_status = task_info.status
  771         status = (raw_status or "").lower()

Lines 774-782

  774         if status in ERROR_STATES:
  775             raw_status = _wait_out_error(lambda: get_info(task_id).status, raw_status)
  776             status = (raw_status or "").lower()
  777             if status == "visualize":
! 778                 return "success"
  779         if status in ERROR_STATES:
  780             try:
  781                 # Try to obtain the error message
  782                 task = SimulationTask(taskId=task_id)

Lines 857-865

  857     def monitor_preprocess() -> None:
  858         """Periodically check the status."""
  859         status = _get_status()
  860         while status not in END_STATES and status != "running":
! 861             new_status = _get_status()
  862             if new_status != status:
  863                 status = new_status
  864                 if verbose and status != "running":
  865                     console.log(f"status = {status}")

Lines 937-945

  937 
  938     else:
  939         # non-verbose case, just keep checking until status is not running or perc_done >= 100
  940         perc_done, _ = get_run_info(task_id)
! 941         while perc_done is not None and perc_done < 100 and _get_status() == "running":
  942             perc_done, field_decay = get_run_info(task_id)
  943             time.sleep(RUN_REFRESH_TIME)
  944 
  945     # post processing

Lines 959-967

  959         if task_type in GUI_SUPPORTED_TASK_TYPES:
  960             url = _get_url(task_id)
  961             console.log(f"View simulation result at [blue underline][link={url}]'{url}'[/link].")
  962     else:
! 963         while _get_status() not in END_STATES:
  964             time.sleep(REFRESH_TIME)
  965 
  966 
  967 @wait_for_connection

marcorudolphflex · 2026-01-28T10:25:22Z

tidy3d/web/api/webapi.py


    # while running but before the percentage done is available, keep waiting
-    while get_run_info(task_id)[0] is None and get_status(task_id) == "running":
+    while get_run_info(task_id)[0] is None and _get_status() == "running":


should the grace period also apply to intermediate hick-ups of get_run_info?

marcorudolphflex · 2026-01-28T10:28:41Z

tidy3d/config/sections.py

    )

+    monitor_error_grace_period: NonNegativeFloat = Field(
+        60.0,


lower this default? what's more realistic and user-friendly - 20-30s?

yaugenst-flex requested review from marcorudolphflex and momchil-flex January 28, 2026 08:58

yaugenst-flex self-assigned this Jan 28, 2026

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

fix(webapi): add grace period for transient error states

37bff73

yaugenst-flex force-pushed the FXC-5149-cli-falsely-reports-fatal-error-when-backend-retries-task-and-later-succeeds branch from 5e252c0 to 37bff73 Compare January 28, 2026 09:02

marcorudolphflex reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(webapi): add grace period for transient error states #3212

fix(webapi): add grace period for transient error states #3212

Uh oh!

yaugenst-flex commented Jan 28, 2026 •

edited by cursor bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

github-actions bot commented Jan 28, 2026

Uh oh!

marcorudolphflex Jan 28, 2026

Uh oh!

marcorudolphflex Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(webapi): add grace period for transient error states #3212

Are you sure you want to change the base?

fix(webapi): add grace period for transient error states #3212

Uh oh!

Conversation

yaugenst-flex commented Jan 28, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 28, 2026

Diff Coverage

Diff: origin/develop...HEAD, staged and unstaged changes

Summary

tidy3d/web/api/webapi.py

Uh oh!

marcorudolphflex Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

marcorudolphflex Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaugenst-flex commented Jan 28, 2026 •

edited by cursor bot

Loading