-
Notifications
You must be signed in to change notification settings - Fork 70
fix(webapi): add grace period for transient error states #3212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
fix(webapi): add grace period for transient error states #3212
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, no comments
5e252c0 to
37bff73
Compare
Diff CoverageDiff: origin/develop...HEAD, staged and unstaged changes
Summary
tidy3d/web/api/webapi.pyLines 744-752 744 """
745
746 def _wait_out_error(fetch_status: Callable[[], str], raw_status: str | None) -> str | None:
747 if error_grace_period <= 0:
! 748 return raw_status
749 deadline = time.monotonic() + error_grace_period
750 status = (raw_status or "").lower()
751 while status in ERROR_STATES and time.monotonic() < deadline:
752 time.sleep(REFRESH_TIME)Lines 755-771 755 return raw_status
756
757 task = TaskFactory.get(task_id)
758 if isinstance(task, BatchTask):
! 759 detail = task.detail()
! 760 raw_status = detail.status
! 761 status = (raw_status or "").lower()
! 762 if status in ERROR_STATES:
! 763 raw_status = _wait_out_error(lambda: task.detail().status, raw_status)
! 764 status = (raw_status or "").lower()
! 765 if status in ERROR_STATES:
! 766 _batch_detail_error(task.task_id)
! 767 return raw_status
768 else:
769 task_info = get_info(task_id)
770 raw_status = task_info.status
771 status = (raw_status or "").lower()Lines 774-782 774 if status in ERROR_STATES:
775 raw_status = _wait_out_error(lambda: get_info(task_id).status, raw_status)
776 status = (raw_status or "").lower()
777 if status == "visualize":
! 778 return "success"
779 if status in ERROR_STATES:
780 try:
781 # Try to obtain the error message
782 task = SimulationTask(taskId=task_id)Lines 857-865 857 def monitor_preprocess() -> None:
858 """Periodically check the status."""
859 status = _get_status()
860 while status not in END_STATES and status != "running":
! 861 new_status = _get_status()
862 if new_status != status:
863 status = new_status
864 if verbose and status != "running":
865 console.log(f"status = {status}")Lines 937-945 937
938 else:
939 # non-verbose case, just keep checking until status is not running or perc_done >= 100
940 perc_done, _ = get_run_info(task_id)
! 941 while perc_done is not None and perc_done < 100 and _get_status() == "running":
942 perc_done, field_decay = get_run_info(task_id)
943 time.sleep(RUN_REFRESH_TIME)
944
945 # post processingLines 959-967 959 if task_type in GUI_SUPPORTED_TASK_TYPES:
960 url = _get_url(task_id)
961 console.log(f"View simulation result at [blue underline][link={url}]'{url}'[/link].")
962 else:
! 963 while _get_status() not in END_STATES:
964 time.sleep(REFRESH_TIME)
965
966
967 @wait_for_connection |
|
|
||
| # while running but before the percentage done is available, keep waiting | ||
| while get_run_info(task_id)[0] is None and get_status(task_id) == "running": | ||
| while get_run_info(task_id)[0] is None and _get_status() == "running": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the grace period also apply to intermediate hick-ups of get_run_info?
| ) | ||
|
|
||
| monitor_error_grace_period: NonNegativeFloat = Field( | ||
| 60.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lower this default? what's more realistic and user-friendly - 20-30s?
Summary
Why
Backend tasks can transiently enter error states during automatic retries while eventually succeeding. The client previously raised immediately on any error status, producing false-negative fatal errors in the CLI. This change lets the monitor wait out brief error windows (configurable) before raising, aligning CLI behavior with backend retry semantics.
Note
Introduces a grace period to prevent premature failures when backend transiently reports error statuses.
get_status(task_id, *, error_grace_period=0.0)to optionally wait outERROR_STATESand surface server error details if the grace period expiresmonitor()status polling through_get_status()usingconfig.web.monitor_error_grace_period(new field, default60.0s)monitor()to use the grace-aware status checks across all task types, includingBatchTaskconfig/README.md; addsmonitor_error_grace_periodtoWebConfiginconfig/sections.pyWritten by Cursor Bugbot for commit 37bff73. This will update automatically on new commits. Configure here.
Greptile Overview
Greptile Summary
This PR adds a configurable grace period to handle transient backend error states during monitoring, preventing false-negative fatal errors in the CLI when backend tasks temporarily enter error states during automatic retries.
Key changes:
error_grace_periodparameter toget_status()that waits out transient error states before raisingmonitor_error_grace_periodconfig field (default 60s) inWebConfigto control grace period behaviormonitor()status checks through the grace period mechanism via config default_wait_out_error()that polls status until recovery or deadline expirationget_status()calls (default 0s grace period)The implementation correctly handles both
BatchTaskand regular task types, checks for "visualize" status (which maps to "success"), and fetches detailed error messages when grace period expires. The change aligns CLI behavior with backend retry semantics while maintaining backward compatibility.Confidence Score: 4/5
Important Files Changed
get_statusand routed monitor through it with config defaultmonitor_error_grace_periodconfig field with 60s defaultSequence Diagram
sequenceDiagram participant User participant monitor participant get_status participant _wait_out_error participant Backend API User->>monitor: monitor(task_id) monitor->>get_status: get_status(task_id, error_grace_period=60s) get_status->>Backend API: get_info(task_id) Backend API-->>get_status: status="run_error" alt status in ERROR_STATES get_status->>_wait_out_error: wait out error loop until deadline or non-error status _wait_out_error->>Backend API: get_info(task_id) alt still in grace period Backend API-->>_wait_out_error: status="run_error" _wait_out_error->>_wait_out_error: sleep(REFRESH_TIME) else recovered Backend API-->>_wait_out_error: status="running" _wait_out_error-->>get_status: return "running" end end alt grace period expired with error get_status->>Backend API: get_error_json() Backend API-->>get_status: error details get_status-->>monitor: raise WebError else recovered within grace period get_status-->>monitor: return "running" end else status not in ERROR_STATES get_status-->>monitor: return status end monitor-->>User: continue monitoring