-
Notifications
You must be signed in to change notification settings - Fork 102
fix: resolve local activities in order of history events #1075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| /// 8: EVENT_TYPE_WORKFLOW_TASK_SCHEDULED | ||
| /// 9: EVENT_TYPE_WORKFLOW_TASK_STARTED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just updating comment here to match actual history
Sushisource
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good fix. Love how you simplified things here.
The only thing left to do here is ensure that we're not breaking any existing histories (that wouldn't have already been broken by this bug). The way to do that is to take your new tests, and run them without your fix (multiple times since they will have different LA ordering?), save the histories (crates/sdk-core/tests/histories), and then use those in the replayer too.
I think, with this fix, it'll be the case that either the history triggers an NDE or not depending on if the original run happened to line up with the now-deterministic ordering of the LAs or not. I think the potentially worrying scenario here is that users possibly could've had workflows that, under replay, would sporadically throw NDEs but eventually pass after a few retries because they happened to hit the right sequence. Now, those workflows may fail/pass 100% of the time. I think that's probably acceptable, considering how rare this bug was to trigger anyway, but the moral of the story is we want to prove to ourselves the circumstances where the behavior change does or does not cause old histories to fail to replay. @mjameswh probably has some good advice about what other tests can be added based on his original research.
| self.process_machine_responses(mk, resps)?; | ||
| } else { | ||
| self.local_activity_data.insert_peeked_marker(*la_dat); | ||
| // Since the LA machine exists, we have encountered the LA WF command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this case even possible? If it is, it feels wrong, and possibly dangerous, to simply drop the marker.
Unless we can prove this case exists, or otherwise prove that dropping the marker would be safe, I think we should just remove the call to will_accept_resolve_marker above, and let try_resolve_with_dat propagate an error should we ever reached that from an unexpected state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is hit during query_during_wft_heartbeat_doesnt_accidentally_fail_to_continue_heartbeat both before/after the changes in this PR. Previously it would be added to peeked collection, but just sit there unused since the LAM was already created.
The test in question executes the LA, but has 2 history polls where the second one includes 2 WFT so we do perform a peek ahead even in this non-replay scenario. The marker will get handled through the normal means when we process the 3rd WFT.
|
Great! Thanks for completing that work. And really pleased to see this ended up removing the need for fake markers, that's a welcome simplification.
I'm quite sure there's no risk of "sporadic" repro on this issue. Legacy behavior was actually deterministic (according to the traditional, non-temporal definition of that word) in both the non-replay and replay cases, though possibly inconsistent between these two (hence the non-determinism issue according to Temporal's definition of that word). That is:
It's true that some workflows were replaying successfully despite technically hitting this LA bug, but that's still fully deterministic (traditional definition) for a given workflow execution. There would be no case of "retry a few times, and it may eventually pass". Note that it is also possible to get various situations where we have within the scope of a single WFT, some LAs that completed within the same WFT that they were scheduled, mixed with other LAs that completed in different WFT (i.e. either scheduled during a previous WFT, or that will complete in a following WFT). I'm definitely NOT confident that we properly handled those scenarios so far in this PR. They will have to be explicitly tested. I have other concerns that I want us to cover in tests. I'll detail that tomorrow. |
What was changed
On replay, local activities will now be resolved in the order that their markers appear in history as opposed to their schedule order.
In order to achieve this a few things had to change:
Peekahead Preresolutions
Previously we stored these as a map, switch to a vec so we can preserve the order we peek them in the history.
LA machine changes
Before

After

Important change here is the addition of a state between scheduling and getting resolved. This allows us to still create the LA machines as they come in, but delays resolving them until later so we can resolve them in the order we peeked the resolutions.
Removal of
FakeMarkerThis was causing failures with the new state machine as we would end up applying workflow completion events to LA machines.
I did a quick test off of
masterand removing theFakeMarkerthat gets emitted on transitioning toWaitingMarkerEventPreResolveddoesn't result in any test failure. I can do this change in a separate PR if desired.Why?
This could cause NDE errors if there were additional commands produced on each LA completion and the first scheduled LA didn't complete first e.g.
would trigger NDE on replay if in history
cfinished beforea.Checklist
Closes [Bug] NDE replaying nested promises sdk-typescript#1744 once TS is updated to include this commit
How was this tested:
Existing test suite. Added test where LAs are resolved in different order than they are scheduled.
Updated TS to this branch and the failing nested promise replay test now passes without the NDE
Any docs updates needed?
No