-
Notifications
You must be signed in to change notification settings - Fork 70
fix(webapi): add grace period for transient error states #3212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yaugenst-flex
wants to merge
1
commit into
develop
Choose a base branch
from
FXC-5149-cli-falsely-reports-fatal-error-when-backend-retries-task-and-later-succeeds
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -732,22 +732,50 @@ def _get_batch_detail_handle_error_status(batch: BatchTask) -> BatchDetail: | |
| return detail | ||
|
|
||
|
|
||
| def get_status(task_id: TaskId) -> str: | ||
| def get_status(task_id: TaskId, *, error_grace_period: float = 0.0) -> str: | ||
| """Get the status of a task. Raises an error if status is "error". | ||
|
|
||
| Parameters | ||
| ---------- | ||
| task_id : str | ||
| Unique identifier of task on server. Returned by :meth:`upload`. | ||
| error_grace_period : float = 0.0 | ||
| Seconds to wait out transient error statuses before raising an error. | ||
| """ | ||
|
|
||
| def _wait_out_error(fetch_status: Callable[[], str], raw_status: str | None) -> str | None: | ||
| if error_grace_period <= 0: | ||
| return raw_status | ||
| deadline = time.monotonic() + error_grace_period | ||
| status = (raw_status or "").lower() | ||
| while status in ERROR_STATES and time.monotonic() < deadline: | ||
| time.sleep(REFRESH_TIME) | ||
| raw_status = fetch_status() | ||
| status = (raw_status or "").lower() | ||
| return raw_status | ||
|
|
||
| task = TaskFactory.get(task_id) | ||
| if isinstance(task, BatchTask): | ||
| return _get_batch_detail_handle_error_status(task).status | ||
| detail = task.detail() | ||
| raw_status = detail.status | ||
| status = (raw_status or "").lower() | ||
| if status in ERROR_STATES: | ||
| raw_status = _wait_out_error(lambda: task.detail().status, raw_status) | ||
| status = (raw_status or "").lower() | ||
| if status in ERROR_STATES: | ||
| _batch_detail_error(task.task_id) | ||
| return raw_status | ||
| else: | ||
| task_info = get_info(task_id) | ||
| status = task_info.status | ||
| raw_status = task_info.status | ||
| status = (raw_status or "").lower() | ||
| if status == "visualize": | ||
| return "success" | ||
| if status in ERROR_STATES: | ||
| raw_status = _wait_out_error(lambda: get_info(task_id).status, raw_status) | ||
| status = (raw_status or "").lower() | ||
| if status == "visualize": | ||
| return "success" | ||
| if status in ERROR_STATES: | ||
| try: | ||
| # Try to obtain the error message | ||
|
|
@@ -762,7 +790,7 @@ def get_status(task_id: TaskId) -> str: | |
| error_msg = "Error message could not be obtained, please contact customer support." | ||
|
|
||
| raise WebError(f"Error running task {task_id}! {error_msg}") | ||
| return status | ||
| return raw_status | ||
|
|
||
|
|
||
| def monitor(task_id: TaskId, verbose: bool = True, worker_group: Optional[str] = None) -> None: | ||
|
|
@@ -823,18 +851,21 @@ def get_estimated_cost() -> float: | |
| est_flex_unit = task_info.estFlexUnit | ||
| return est_flex_unit | ||
|
|
||
| def _get_status() -> str: | ||
| return get_status(task_id, error_grace_period=config.web.monitor_error_grace_period) | ||
|
|
||
| def monitor_preprocess() -> None: | ||
| """Periodically check the status.""" | ||
| status = get_status(task_id) | ||
| status = _get_status() | ||
| while status not in END_STATES and status != "running": | ||
| new_status = get_status(task_id) | ||
| new_status = _get_status() | ||
| if new_status != status: | ||
| status = new_status | ||
| if verbose and status != "running": | ||
| console.log(f"status = {status}") | ||
| time.sleep(REFRESH_TIME) | ||
|
|
||
| status = get_status(task_id) | ||
| status = _get_status() | ||
|
|
||
| if verbose: | ||
| console.log(f"status = {status}") | ||
|
|
@@ -861,7 +892,7 @@ def monitor_preprocess() -> None: | |
| console.log("starting up solver") | ||
|
|
||
| # while running but before the percentage done is available, keep waiting | ||
| while get_run_info(task_id)[0] is None and get_status(task_id) == "running": | ||
| while get_run_info(task_id)[0] is None and _get_status() == "running": | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should the grace period also apply to intermediate hick-ups of |
||
| time.sleep(REFRESH_TIME) | ||
|
|
||
| # while running but percentage done is available | ||
|
|
@@ -873,9 +904,7 @@ def monitor_preprocess() -> None: | |
| pbar_pd = progress.add_task("% done", total=100) | ||
| perc_done, _ = get_run_info(task_id) | ||
|
|
||
| while ( | ||
| perc_done is not None and perc_done < 100 and get_status(task_id) == "running" | ||
| ): | ||
| while perc_done is not None and perc_done < 100 and _get_status() == "running": | ||
| perc_done, field_decay = get_run_info(task_id) | ||
| new_description = f"solver progress (field decay = {field_decay:.2e})" | ||
| progress.update(pbar_pd, completed=perc_done, description=new_description) | ||
|
|
@@ -892,9 +921,7 @@ def monitor_preprocess() -> None: | |
| pbar_pd = progress.add_task("% done", total=100) | ||
| perc_done, _ = get_run_info(task_id) | ||
|
|
||
| while ( | ||
| perc_done is not None and perc_done < 100 and get_status(task_id) == "running" | ||
| ): | ||
| while perc_done is not None and perc_done < 100 and _get_status() == "running": | ||
| perc_done, _ = get_run_info(task_id) | ||
| new_description = "solver progress" | ||
| progress.update(pbar_pd, completed=perc_done, description=new_description) | ||
|
|
@@ -904,26 +931,26 @@ def monitor_preprocess() -> None: | |
| new_description = "solver progress" | ||
| progress.update(pbar_pd, completed=100, refresh=True, description=new_description) | ||
| else: | ||
| while get_status(task_id) == "running": | ||
| while _get_status() == "running": | ||
| perc_done, _ = get_run_info(task_id) | ||
| time.sleep(RUN_REFRESH_TIME) | ||
|
|
||
| else: | ||
| # non-verbose case, just keep checking until status is not running or perc_done >= 100 | ||
| perc_done, _ = get_run_info(task_id) | ||
| while perc_done is not None and perc_done < 100 and get_status(task_id) == "running": | ||
| while perc_done is not None and perc_done < 100 and _get_status() == "running": | ||
| perc_done, field_decay = get_run_info(task_id) | ||
| time.sleep(RUN_REFRESH_TIME) | ||
|
|
||
| # post processing | ||
| if verbose: | ||
| status = get_status(task_id) | ||
| status = _get_status() | ||
| if status != "running": | ||
| console.log(f"status = {status}") | ||
|
|
||
| with console.status(f"[bold green]Finishing '{task_name}'...", spinner="runner"): | ||
| while status not in END_STATES: | ||
| new_status = get_status(task_id) | ||
| new_status = _get_status() | ||
| if new_status != status: | ||
| status = new_status | ||
| console.log(f"status = {status}") | ||
|
|
@@ -933,7 +960,7 @@ def monitor_preprocess() -> None: | |
| url = _get_url(task_id) | ||
| console.log(f"View simulation result at [blue underline][link={url}]'{url}'[/link].") | ||
| else: | ||
| while get_status(task_id) not in END_STATES: | ||
| while _get_status() not in END_STATES: | ||
| time.sleep(REFRESH_TIME) | ||
|
|
||
|
|
||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lower this default? what's more realistic and user-friendly - 20-30s?