-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285
Comments
I saw you commented in another question(#15807 (comment)), please take some time to look at this one, thanks @SbloodyS |
The |
Thanks for your comment, I know the When I re-execute a node in the upstream, the dependent node in the downstream will fail if the downstream is executed after the upstream is successfully re-executed. e.g: Upstream dispatch at 1 a.m. Meanwhile, downstream dispatch at 10 a.m. If I re-execute one task in the upstream, that is not dependent on the dependent node in the downstream, in the upstream at 7 a.m. And it will be successful in 1 hour. The downstream will fail in few seconds when it's dispatched at 10 a.m. I hope I made myself clear. This may not be a code bug, but it is a very common requirement. Is there any good way to avoid this situation? |
Like I said. The |
I'm sure I get you. But my downstream don't depend on all tasks in the upstream, but only one task. I'll show you some screenshots. e.g: Upstream dispatch at 1 a.m. Meanwhile, downstream dispatch at 10 a.m. If I re-execute one task in the upstream, that is not dependent on the dependent node in the downstream, in the upstream at 7 a.m. And it will be successful in 1 hour. The downstream will fail in few seconds when it's dispatched at 10 a.m. This is how I re-excute only one task in the upstream: |
I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion. |
Yes, that's what I mean. Btw, my version is 3.1.8.(The logic of this part is still the same in later versions). I have checked the code, and I think this scenario is common, especially if you've just switched to DS from another dispatch system. Because tasks will not be migrated all at once, but will be migrated one by one, which means the process definition will be modified frequently. Maybe we can start with these two places in the code: |
In my opinion, this is a bug, the task instance status should not bind with workflow instance status. |
Looking forward to others' opinions. |
+1 |
After take a deep look at it, I found the task instance status still bind with workflow instance status in the |
You mean current fix is not good enough, right? BTW, I found a new scenario which could cause the same problem. Both workflows are the same as #16285 (comment) I set the failure retry time for A-3 to 5 minutes. Assume that A workflow failed at 7:00 am, because A-3 is failed. Then B workflow failed at 7:03, because at that time, A-3's status is failure. At 7:05, A-3 retried and succeeded. So, when I check the result of the both workflows, I found the status of A is success and B is failure. So, I think if a task like A-3 is depended by other tasks and a retry policy has been set, the status of the downstream task should be waiting until the upstream task has no more retries. |
This PR #15795 fixes a different issue. |
#15795 has nothing to do with |
ok, I get you. I mean #15795 worked. I'll find a new way to fix this issue. BTW, please take a look at this scenario(#16285 (comment)) again. Do you think the status of the downstream task should be waiting until the upstream task has no more retries? |
Yes. However, it is not easy to implement under the existing architecture. We can optimze it after this refatoring is merged. #16423 |
OK, I'll find a new way to fix this issue first, which will make the task instance not bind with the workflow instance status. And I'll take a look at the new architecture later. |
You can start from here |
@starrysxy It is a bug. However, I don't think it still exists in 3.2.2-release or dev. You could cherry-pick related stuff to patch your local code. As you could see in the below code fragments, in the latest logic, the query joins the task instances table to make sure the upstream task included when querying last workflow instance. Lines 192 to 234 in 57c80f2
But this logic was added only in and after 3.2.2-release: Lines 209 to 236 in 15d356b
|
@EricGao888 Yes, like I said, #15795 and #15712 worked. But I think @SbloodyS wants to find a new way to fix this bug, which will make the task instance not bind with the workflow instance status. Am I right?
My dev branch is the lastest version, and I have also checked 3.2.2-release branch, but I can't find |
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs. |
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future. |
Search before asking
What happened
First of all, I'm not sure if this is a code bug. I checked the code and the phenomenon is consistent with the current code logic, but for the business logic, I don't think it is very reasonable. If necessary, please help to modify the tag.
Suppose there are two workflows A and B. Workflow A has tasks A-1, A-2, and A-3. Workflow B depends on task A-3 in workflow A. When the A workflow is finished, if the A-1 task in the A workflow is re-executed separately and the B workflow instance in the same cycle has not been executed, the dependent node in the B workflow will fail, which will cause the B workflow instance to fail.
The logic in the code is to find the workflow instance with the latest endTime in each cycle, so the workflow instance where the A-1 task is executed alone will be found, but this workflow instance does not have the A-3 task that the downstream B workflow depends on. Therefore, after getting the A workflow instance, the A-3 task instance cannot be found when traversing the task instances. At the same time, the A workflow instance is in the completed state, so the dependent node is marked as failed, and then the B workflow is marked as failed.
In my opinion, the logic of this part of the code is to facilitate the selection of 'ALL' for dependent tasks, without having to check the status of each task in the upstream workflow, but directly check the status of the entire upstream workflow. However, I think that in actual work, it is inevitable to modify the workflow, and it is also inevitable to re-execute the task after modifying the workflow. At the same time, re-executing the entire workflow may lead to problems such as late result output and waste of machine resources. Therefore, the logic here may be optimized.
What you expected to happen
If a dependent task in an upstream workflow has ever succeeded, the node status of the dependent task in the downstream workflow of the same cycle should be marked as successful.
How to reproduce
Suppose there are two workflows A and B. Workflow A has tasks A-1, A-2, and A-3. Workflow B depends on task A-3 in workflow A. When the A workflow is finished, if the A-1 task in the A workflow is re-executed separately and the B workflow instance in the same cycle has not been executed, the dependent node in the B workflow will fail, which will cause the B workflow instance to fail.
Anything else
No response
Version
3.1.x
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: