Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

Closed
3 tasks done
starrysxy opened this issue Jul 7, 2024 · 24 comments
Closed
3 tasks done
Assignees
Labels
bug Something isn't working discussion discussion priority:middle Stale

Comments

@starrysxy
Copy link
Contributor

starrysxy commented Jul 7, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

First of all, I'm not sure if this is a code bug. I checked the code and the phenomenon is consistent with the current code logic, but for the business logic, I don't think it is very reasonable. If necessary, please help to modify the tag.

Suppose there are two workflows A and B. Workflow A has tasks A-1, A-2, and A-3. Workflow B depends on task A-3 in workflow A. When the A workflow is finished, if the A-1 task in the A workflow is re-executed separately and the B workflow instance in the same cycle has not been executed, the dependent node in the B workflow will fail, which will cause the B workflow instance to fail.

The logic in the code is to find the workflow instance with the latest endTime in each cycle, so the workflow instance where the A-1 task is executed alone will be found, but this workflow instance does not have the A-3 task that the downstream B workflow depends on. Therefore, after getting the A workflow instance, the A-3 task instance cannot be found when traversing the task instances. At the same time, the A workflow instance is in the completed state, so the dependent node is marked as failed, and then the B workflow is marked as failed.

In my opinion, the logic of this part of the code is to facilitate the selection of 'ALL' for dependent tasks, without having to check the status of each task in the upstream workflow, but directly check the status of the entire upstream workflow. However, I think that in actual work, it is inevitable to modify the workflow, and it is also inevitable to re-execute the task after modifying the workflow. At the same time, re-executing the entire workflow may lead to problems such as late result output and waste of machine resources. Therefore, the logic here may be optimized.

What you expected to happen

If a dependent task in an upstream workflow has ever succeeded, the node status of the dependent task in the downstream workflow of the same cycle should be marked as successful.

How to reproduce

Suppose there are two workflows A and B. Workflow A has tasks A-1, A-2, and A-3. Workflow B depends on task A-3 in workflow A. When the A workflow is finished, if the A-1 task in the A workflow is re-executed separately and the B workflow instance in the same cycle has not been executed, the dependent node in the B workflow will fail, which will cause the B workflow instance to fail.

Anything else

No response

Version

3.1.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@starrysxy starrysxy added bug Something isn't working Waiting for reply Waiting for reply labels Jul 7, 2024
@starrysxy
Copy link
Contributor Author

I saw you commented in another question(#15807 (comment)), please take some time to look at this one, thanks @SbloodyS

@SbloodyS SbloodyS added question Further information is requested and removed bug Something isn't working Waiting for reply Waiting for reply labels Jul 8, 2024
@SbloodyS SbloodyS changed the title [Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on [Question] [Master] Dependent nodes failed when the upstream rerun without the task relied on Jul 8, 2024
@SbloodyS
Copy link
Member

SbloodyS commented Jul 8, 2024

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

@starrysxy
Copy link
Contributor Author

starrysxy commented Jul 8, 2024

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Thanks for your comment, I know the ALL in a dependency means that all tasks in the workflow need to be executed successfully. But my downstream only depends on 1 task in the upstream, not all tasks.

When I re-execute a node in the upstream, the dependent node in the downstream will fail if the downstream is executed after the upstream is successfully re-executed.

e.g: Upstream dispatch at 1 a.m. Meanwhile, downstream dispatch at 10 a.m. If I re-execute one task in the upstream, that is not dependent on the dependent node in the downstream, in the upstream at 7 a.m. And it will be successful in 1 hour. The downstream will fail in few seconds when it's dispatched at 10 a.m.

I hope I made myself clear. This may not be a code bug, but it is a very common requirement. Is there any good way to avoid this situation?

@SbloodyS
Copy link
Member

SbloodyS commented Jul 9, 2024

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Like I said. The ALL in a dependency means that all tasks in the workflow need to be executed successfully. In particular, it prevents you from manually re-executing a node like this. If the downstream execution continues only after all upstream nodes are executed, manually executing a single node is an illegal operation, which may cause downstream data exceptions.

@starrysxy
Copy link
Contributor Author

starrysxy commented Jul 9, 2024

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Like I said. The ALL in a dependency means that all tasks in the workflow need to be executed successfully. In particular, it prevents you from manually re-executing a node like this. If the downstream execution continues only after all upstream nodes are executed, manually executing a single node is an illegal operation, which may cause downstream data exceptions.

I'm sure I get you. But my downstream don't depend on all tasks in the upstream, but only one task. I'll show you some screenshots.

e.g: Upstream dispatch at 1 a.m. Meanwhile, downstream dispatch at 10 a.m. If I re-execute one task in the upstream, that is not dependent on the dependent node in the downstream, in the upstream at 7 a.m. And it will be successful in 1 hour. The downstream will fail in few seconds when it's dispatched at 10 a.m.

This is my upstream:
daa236d776e100056fe3b606d919e9e

This is my downstream:
3e0fce1054ad27666a66260dc5845e3

This is the dependent node:
0aafe2fd40c157bd64c53963b678852

This is how I re-excute only one task in the upstream:
(Right-click on the task node and type the 'excute' button)
e6cbc3bd6740c99b33bd99e2544aedb

@SbloodyS
Copy link
Member

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

@SbloodyS SbloodyS added bug Something isn't working discussion discussion priority:middle and removed question Further information is requested labels Jul 10, 2024
@SbloodyS SbloodyS changed the title [Question] [Master] Dependent nodes failed when the upstream rerun without the task relied on [Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on Jul 10, 2024
@starrysxy
Copy link
Contributor Author

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

Yes, that's what I mean. Btw, my version is 3.1.8.(The logic of this part is still the same in later versions). I have checked the code, and I think this scenario is common, especially if you've just switched to DS from another dispatch system. Because tasks will not be migrated all at once, but will be migrated one by one, which means the process definition will be modified frequently.

Maybe we can start with these two places in the code:

  1. org.apache.dolphinscheduler.server.master.utils.DependentExecute#calculateResultForTasks
    image

  2. org.apache.dolphinscheduler.server.master.utils.DependentExecute#getDependTaskResult
    image

@ruanwenjun
Copy link
Member

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

In my opinion, this is a bug, the task instance status should not bind with workflow instance status.

@starrysxy
Copy link
Contributor Author

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

Looking forward to others' opinions.

@SbloodyS
Copy link
Member

In my opinion, this is a bug, the task instance status should not bind with workflow instance status.

+1

@starrysxy
Copy link
Contributor Author

@SbloodyS When I try to fix this bug, using the dev branch, I found it has been fixed in #15795. But the bug is still there in version 3.1.x. I'm not sure if I should commit to fix this bug in version 3.1.x. If yes, which brach should I use to commit a PR.

@SbloodyS
Copy link
Member

@SbloodyS When I try to fix this bug, using the dev branch, I found it has been fixed in #15795. But the bug is still there in version 3.1.x. I'm not sure if I should commit to fix this bug in version 3.1.x. If yes, which brach should I use to commit a PR.

After take a deep look at it, I found the task instance status still bind with workflow instance status in the dev branch.

@starrysxy
Copy link
Contributor Author

@SbloodyS When I try to fix this bug, using the dev branch, I found it has been fixed in #15795. But the bug is still there in version 3.1.x. I'm not sure if I should commit to fix this bug in version 3.1.x. If yes, which brach should I use to commit a PR.

After take a deep look at it, I found the task instance status still bind with workflow instance status in the dev branch.

You mean current fix is not good enough, right? BTW, I found a new scenario which could cause the same problem.

Both workflows are the same as #16285 (comment)

I set the failure retry time for A-3 to 5 minutes. Assume that A workflow failed at 7:00 am, because A-3 is failed. Then B workflow failed at 7:03, because at that time, A-3's status is failure. At 7:05, A-3 retried and succeeded. So, when I check the result of the both workflows, I found the status of A is success and B is failure.

So, I think if a task like A-3 is depended by other tasks and a retry policy has been set, the status of the downstream task should be waiting until the upstream task has no more retries.

@SbloodyS
Copy link
Member

This PR #15795 fixes a different issue.

@starrysxy
Copy link
Contributor Author

starrysxy commented Aug 15, 2024

This PR #15795 fixes a different issue.

I think this PR fixes the same issue. The defference between #15795 and #15707 is that the instance is triggered by scheduler or manually. Both will get the same result.

@SbloodyS
Copy link
Member

#15795 has nothing to do with the task instance status should not bind with workflow instance status

@starrysxy
Copy link
Contributor Author

#15795 has nothing to do with the task instance status should not bind with workflow instance status

ok, I get you. I mean #15795 worked. I'll find a new way to fix this issue.

BTW, please take a look at this scenario(#16285 (comment)) again.

Do you think the status of the downstream task should be waiting until the upstream task has no more retries?

@SbloodyS
Copy link
Member

Do you think the status of the downstream task should be waiting until the upstream task has no more retries?

Yes. However, it is not easy to implement under the existing architecture. We can optimze it after this refatoring is merged. #16423

@starrysxy
Copy link
Contributor Author

OK, I'll find a new way to fix this issue first, which will make the task instance not bind with the workflow instance status. And I'll take a look at the new architecture later.

@SbloodyS
Copy link
Member

OK, I'll find a new way to fix this issue first, which will make the task instance not bind with the workflow instance status. And I'll take a look at the new architecture later.

You can start from here org.apache.dolphinscheduler.server.master.utils.DependentExecute#dependResultByAllTaskOfProcessInstance

@EricGao888
Copy link
Member

@starrysxy It is a bug. However, I don't think it still exists in 3.2.2-release or dev. You could cherry-pick related stuff to patch your local code.

As you could see in the below code fragments, in the latest logic, the query joins the task instances table to make sure the upstream task included when querying last workflow instance.

<select id="queryByProcessDefineCode" resultType="org.apache.dolphinscheduler.dao.entity.ProcessInstance">
select
<include refid="baseSql"/>
from t_ds_process_instance
where process_definition_code=#{processDefinitionCode}
order by start_time desc limit #{size}
</select>
<select id="queryLastSchedulerProcess" resultType="org.apache.dolphinscheduler.dao.entity.ProcessInstance">
select t1.* from (select
<include refid="baseSql"/>
from t_ds_process_instance
where process_definition_code=#{processDefinitionCode} and test_flag=#{testFlag}
<if test="startTime != null and endTime != null ">
and schedule_time <![CDATA[ >= ]]> #{startTime} and schedule_time <![CDATA[ <= ]]> #{endTime}
</if>) as t1
<if test="taskDefinitionCode != null and taskDefinitionCode != 0 and taskDefinitionCode != -1">
inner join
t_ds_task_instance as t2
on t1.id = t2.process_instance_id and t2.task_code=#{taskDefinitionCode}
</if>
order by end_time desc limit 1
</select>
<select id="queryLastManualProcess" resultType="org.apache.dolphinscheduler.dao.entity.ProcessInstance">
select t1.*
from
(
select
<include refid="baseSql"/>
from t_ds_process_instance
where
process_definition_code=#{processDefinitionCode} and test_flag=#{testFlag}
and schedule_time is null
<if test="startTime!=null and endTime != null ">
and start_time <![CDATA[ >= ]]> #{startTime} and start_time <![CDATA[ <= ]]> #{endTime}
</if>
) as t1
<if test="taskCode != null and taskCode!=0 and taskCode!=-1">
inner join
t_ds_task_instance as t2
on t1.id = t2.process_instance_id and t2.task_code=#{taskCode}
</if>
order by t1.end_time desc limit 1
</select>

But this logic was added only in and after 3.2.2-release:

<select id="queryLastRunningProcess" resultType="org.apache.dolphinscheduler.dao.entity.ProcessInstance">
select
<include refid="baseSql"/>
from t_ds_process_instance
where process_definition_code=#{processDefinitionCode}
<if test="states !=null and states.length != 0">
and state in
<foreach collection="states" item="i" index="index" open="(" separator="," close=")">
#{i}
</foreach>
</if>
<if test="startTime!=null and endTime != null ">
and ((schedule_time <![CDATA[ >= ]]> #{startTime} and schedule_time <![CDATA[ <= ]]> #{endTime})
or (start_time <![CDATA[ >= ]]> #{startTime} and start_time <![CDATA[ <= ]]> #{endTime}))
</if>
order by start_time desc limit 1
</select>
<select id="queryLastManualProcess" resultType="org.apache.dolphinscheduler.dao.entity.ProcessInstance">
select
<include refid="baseSql"/>
from t_ds_process_instance
where process_definition_code=#{processDefinitionCode}
and schedule_time is null
<if test="startTime!=null and endTime != null ">
and start_time <![CDATA[ >= ]]> #{startTime} and start_time <![CDATA[ <= ]]> #{endTime}
</if>
order by end_time desc limit 1
</select>

cc @SbloodyS @ruanwenjun

@starrysxy
Copy link
Contributor Author

starrysxy commented Aug 16, 2024

@EricGao888 Yes, like I said, #15795 and #15712 worked. But I think @SbloodyS wants to find a new way to fix this bug, which will make the task instance not bind with the workflow instance status. Am I right?

But this logic was added only in and after 3.2.2-release:
<select id="queryLastRunningProcess"...>
...

My dev branch is the lastest version, and I have also checked 3.2.2-release branch, but I can't find <select id="queryLastRunningProcess"...>.

Copy link

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Sep 17, 2024
Copy link

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion discussion priority:middle Stale
Projects
None yet
Development

No branches or pull requests

4 participants