WIP: try to recover failing job #1142
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposing a recovery method for failing orders
This is a WIP in a suggestion in how to solve the issue where orders is marked as
failed
beacause of the following error during proving:I hope that we manage to solve this behavior. Currently the proposed changes is not working, but i hope that someone can assist with that.
Specifically we need a way to throw an enum that we can listen to and do a cleaning of tasks that is failing. Im not sure if the tasks fail because the segment data is corrupted or if it could be retried with same segment data and succesfully be proven.
I have managed to reset orders by cleaning the job and tasks and then set the orderstatus for the failed order to PendingProving. This works, but its not optimal e.g. when 90% into proving a 51B job it fails and we must start over.
Here is my current workflow for the issue where I tried to automate resetting orders marked as
Ffailed
Context
On my node I’ve consistently seen that larger orders sometimes fail with:
Proving failed after retries …
Monitoring proof (stark) failed: [B-BON-005] Prover failure: SessionId
Smaller orders (<300 cycles) almost never fail, but larger ones do so frequently.
I’ve investigated this in the Rust code and proposed a fix in this PR, but I also want to document the current workaround I use in production.
Current Workaround
I run a
systemd
service that tails the broker logs, detects failing orders, and automatically resets them.This makes the broker pick up the order again for proving without manual intervention.
Service setup
auto-reset.service
: