[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

starrysxy · 2024-07-07T03:25:26Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

First of all, I'm not sure if this is a code bug. I checked the code and the phenomenon is consistent with the current code logic, but for the business logic, I don't think it is very reasonable. If necessary, please help to modify the tag.

Suppose there are two workflows A and B. Workflow A has tasks A-1, A-2, and A-3. Workflow B depends on task A-3 in workflow A. When the A workflow is finished, if the A-1 task in the A workflow is re-executed separately and the B workflow instance in the same cycle has not been executed, the dependent node in the B workflow will fail, which will cause the B workflow instance to fail.

The logic in the code is to find the workflow instance with the latest endTime in each cycle, so the workflow instance where the A-1 task is executed alone will be found, but this workflow instance does not have the A-3 task that the downstream B workflow depends on. Therefore, after getting the A workflow instance, the A-3 task instance cannot be found when traversing the task instances. At the same time, the A workflow instance is in the completed state, so the dependent node is marked as failed, and then the B workflow is marked as failed.

In my opinion, the logic of this part of the code is to facilitate the selection of 'ALL' for dependent tasks, without having to check the status of each task in the upstream workflow, but directly check the status of the entire upstream workflow. However, I think that in actual work, it is inevitable to modify the workflow, and it is also inevitable to re-execute the task after modifying the workflow. At the same time, re-executing the entire workflow may lead to problems such as late result output and waste of machine resources. Therefore, the logic here may be optimized.

What you expected to happen

If a dependent task in an upstream workflow has ever succeeded, the node status of the dependent task in the downstream workflow of the same cycle should be marked as successful.

How to reproduce

Suppose there are two workflows A and B. Workflow A has tasks A-1, A-2, and A-3. Workflow B depends on task A-3 in workflow A. When the A workflow is finished, if the A-1 task in the A workflow is re-executed separately and the B workflow instance in the same cycle has not been executed, the dependent node in the B workflow will fail, which will cause the B workflow instance to fail.

Anything else

No response

Version

3.1.x

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

starrysxy · 2024-07-07T03:30:41Z

I saw you commented in another question(#15807 (comment)), please take some time to look at this one, thanks @SbloodyS

SbloodyS · 2024-07-08T13:05:20Z

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

starrysxy · 2024-07-08T14:57:04Z

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Thanks for your comment, I know the ALL in a dependency means that all tasks in the workflow need to be executed successfully. But my downstream only depends on 1 task in the upstream, not all tasks.

When I re-execute a node in the upstream, the dependent node in the downstream will fail if the downstream is executed after the upstream is successfully re-executed.

e.g: Upstream dispatch at 1 a.m. Meanwhile, downstream dispatch at 10 a.m. If I re-execute one task in the upstream, that is not dependent on the dependent node in the downstream, in the upstream at 7 a.m. And it will be successful in 1 hour. The downstream will fail in few seconds when it's dispatched at 10 a.m.

I hope I made myself clear. This may not be a code bug, but it is a very common requirement. Is there any good way to avoid this situation?

SbloodyS · 2024-07-09T01:11:18Z

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Like I said. The ALL in a dependency means that all tasks in the workflow need to be executed successfully. In particular, it prevents you from manually re-executing a node like this. If the downstream execution continues only after all upstream nodes are executed, manually executing a single node is an illegal operation, which may cause downstream data exceptions.

starrysxy · 2024-07-09T01:53:21Z

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Like I said. The ALL in a dependency means that all tasks in the workflow need to be executed successfully. In particular, it prevents you from manually re-executing a node like this. If the downstream execution continues only after all upstream nodes are executed, manually executing a single node is an illegal operation, which may cause downstream data exceptions.

I'm sure I get you. But my downstream don't depend on all tasks in the upstream, but only one task. I'll show you some screenshots.

e.g: Upstream dispatch at 1 a.m. Meanwhile, downstream dispatch at 10 a.m. If I re-execute one task in the upstream, that is not dependent on the dependent node in the downstream, in the upstream at 7 a.m. And it will be successful in 1 hour. The downstream will fail in few seconds when it's dispatched at 10 a.m.

This is my upstream:

This is my downstream:

This is the dependent node:

This is how I re-excute only one task in the upstream:
(Right-click on the task node and type the 'excute' button)

SbloodyS · 2024-07-10T06:48:58Z

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

starrysxy · 2024-07-10T07:42:08Z

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

Yes, that's what I mean. Btw, my version is 3.1.8.(The logic of this part is still the same in later versions). I have checked the code, and I think this scenario is common, especially if you've just switched to DS from another dispatch system. Because tasks will not be migrated all at once, but will be migrated one by one, which means the process definition will be modified frequently.

Maybe we can start with these two places in the code:

org.apache.dolphinscheduler.server.master.utils.DependentExecute#calculateResultForTasks
org.apache.dolphinscheduler.server.master.utils.DependentExecute#getDependTaskResult

ruanwenjun · 2024-07-10T17:33:55Z

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

In my opinion, this is a bug, the task instance status should not bind with workflow instance status.

starrysxy · 2024-07-15T15:35:58Z

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

Looking forward to others' opinions.

SbloodyS · 2024-07-16T06:14:20Z

In my opinion, this is a bug, the task instance status should not bind with workflow instance status.

+1

starrysxy · 2024-07-25T10:10:15Z

@SbloodyS When I try to fix this bug, using the dev branch, I found it has been fixed in #15795. But the bug is still there in version 3.1.x. I'm not sure if I should commit to fix this bug in version 3.1.x. If yes, which brach should I use to commit a PR.

SbloodyS · 2024-08-15T03:01:48Z

@SbloodyS When I try to fix this bug, using the dev branch, I found it has been fixed in #15795. But the bug is still there in version 3.1.x. I'm not sure if I should commit to fix this bug in version 3.1.x. If yes, which brach should I use to commit a PR.

After take a deep look at it, I found the task instance status still bind with workflow instance status in the dev branch.

starrysxy · 2024-08-15T05:07:33Z

@SbloodyS When I try to fix this bug, using the dev branch, I found it has been fixed in #15795. But the bug is still there in version 3.1.x. I'm not sure if I should commit to fix this bug in version 3.1.x. If yes, which brach should I use to commit a PR.

After take a deep look at it, I found the task instance status still bind with workflow instance status in the dev branch.

You mean current fix is not good enough, right? BTW, I found a new scenario which could cause the same problem.

Both workflows are the same as #16285 (comment)

I set the failure retry time for A-3 to 5 minutes. Assume that A workflow failed at 7:00 am, because A-3 is failed. Then B workflow failed at 7:03, because at that time, A-3's status is failure. At 7:05, A-3 retried and succeeded. So, when I check the result of the both workflows, I found the status of A is success and B is failure.

So, I think if a task like A-3 is depended by other tasks and a retry policy has been set, the status of the downstream task should be waiting until the upstream task has no more retries.

SbloodyS · 2024-08-15T06:37:01Z

This PR #15795 fixes a different issue.

starrysxy · 2024-08-15T07:03:11Z

This PR #15795 fixes a different issue.

I think this PR fixes the same issue. The defference between #15795 and #15707 is that the instance is triggered by scheduler or manually. Both will get the same result.

SbloodyS · 2024-08-15T07:11:41Z

#15795 has nothing to do with the task instance status should not bind with workflow instance status

starrysxy · 2024-08-15T07:18:50Z

#15795 has nothing to do with the task instance status should not bind with workflow instance status

ok, I get you. I mean #15795 worked. I'll find a new way to fix this issue.

BTW, please take a look at this scenario(#16285 (comment)) again.

Do you think the status of the downstream task should be waiting until the upstream task has no more retries?

SbloodyS · 2024-08-15T07:24:41Z

Do you think the status of the downstream task should be waiting until the upstream task has no more retries?

Yes. However, it is not easy to implement under the existing architecture. We can optimze it after this refatoring is merged. #16423

starrysxy · 2024-08-15T12:19:50Z

OK, I'll find a new way to fix this issue first, which will make the task instance not bind with the workflow instance status. And I'll take a look at the new architecture later.

SbloodyS · 2024-08-15T12:49:42Z

OK, I'll find a new way to fix this issue first, which will make the task instance not bind with the workflow instance status. And I'll take a look at the new architecture later.

You can start from here org.apache.dolphinscheduler.server.master.utils.DependentExecute#dependResultByAllTaskOfProcessInstance

EricGao888 · 2024-08-16T03:12:22Z

@starrysxy It is a bug. However, I don't think it still exists in 3.2.2-release or dev. You could cherry-pick related stuff to patch your local code.

As you could see in the below code fragments, in the latest logic, the query joins the task instances table to make sure the upstream task included when querying last workflow instance.

dolphinscheduler/dolphinscheduler-dao/src/main/resources/org/apache/dolphinscheduler/dao/mapper/ProcessInstanceMapper.xml

Lines 192 to 234 in 57c80f2

    
           <select id="queryByProcessDefineCode" resultType="org.apache.dolphinscheduler.dao.entity.ProcessInstance"> 
        
               select 
        
               <include refid="baseSql"/> 
        
               from t_ds_process_instance 
        
               where process_definition_code=#{processDefinitionCode} 
        
               order by start_time desc limit #{size} 
        
           </select> 
        
           <select id="queryLastSchedulerProcess" resultType="org.apache.dolphinscheduler.dao.entity.ProcessInstance"> 
        
               select t1.* from (select 
        
               <include refid="baseSql"/> 
        
               from t_ds_process_instance 
        
               where process_definition_code=#{processDefinitionCode} and test_flag=#{testFlag} 
        
               <if test="startTime != null and endTime != null "> 
        
                   and schedule_time <![CDATA[ >= ]]> #{startTime} and schedule_time <![CDATA[ <= ]]> #{endTime} 
        
               </if>) as t1 
        
               <if test="taskDefinitionCode != null and taskDefinitionCode != 0 and taskDefinitionCode != -1"> 
        
                   inner join 
        
                   t_ds_task_instance as t2 
        
                   on t1.id = t2.process_instance_id and t2.task_code=#{taskDefinitionCode} 
        
               </if> 
        
               order by end_time desc limit 1 
        
           </select> 
        
           <select id="queryLastManualProcess" resultType="org.apache.dolphinscheduler.dao.entity.ProcessInstance"> 
        
               select t1.* 
        
               from 
        
               ( 
        
                   select 
        
                   <include refid="baseSql"/> 
        
                   from t_ds_process_instance 
        
                   where 
        
                   process_definition_code=#{processDefinitionCode} and test_flag=#{testFlag} 
        
                   and schedule_time is null 
        
                   <if test="startTime!=null and endTime != null "> 
        
                       and start_time <![CDATA[ >= ]]> #{startTime} and start_time <![CDATA[ <= ]]> #{endTime} 
        
                   </if> 
        
               ) as t1 
        
               <if test="taskCode != null and taskCode!=0 and taskCode!=-1"> 
        
               inner join 
        
               t_ds_task_instance as t2 
        
               on t1.id = t2.process_instance_id and t2.task_code=#{taskCode} 
        
               </if> 
        
               order by t1.end_time desc limit 1 
        
           </select>

But this logic was added only in and after 3.2.2-release:

dolphinscheduler/dolphinscheduler-dao/src/main/resources/org/apache/dolphinscheduler/dao/mapper/ProcessInstanceMapper.xml

Lines 209 to 236 in 15d356b

    
           <select id="queryLastRunningProcess" resultType="org.apache.dolphinscheduler.dao.entity.ProcessInstance"> 
        
               select 
        
               <include refid="baseSql"/> 
        
               from t_ds_process_instance 
        
               where process_definition_code=#{processDefinitionCode} 
        
               <if test="states !=null and states.length != 0"> 
        
                   and state in 
        
                   <foreach collection="states" item="i" index="index" open="(" separator="," close=")"> 
        
                       #{i} 
        
                   </foreach> 
        
               </if> 
        
               <if test="startTime!=null and endTime != null "> 
        
                   and ((schedule_time <![CDATA[ >= ]]> #{startTime} and schedule_time <![CDATA[ <= ]]> #{endTime}) 
        
                   or (start_time <![CDATA[ >= ]]> #{startTime} and start_time <![CDATA[ <= ]]> #{endTime})) 
        
               </if> 
        
               order by start_time desc limit 1 
        
           </select> 
        
           <select id="queryLastManualProcess" resultType="org.apache.dolphinscheduler.dao.entity.ProcessInstance"> 
        
               select 
        
               <include refid="baseSql"/> 
        
               from t_ds_process_instance 
        
               where process_definition_code=#{processDefinitionCode} 
        
               and schedule_time is null 
        
               <if test="startTime!=null and endTime != null "> 
        
                   and start_time <![CDATA[ >= ]]> #{startTime} and start_time <![CDATA[ <= ]]> #{endTime} 
        
               </if> 
        
               order by end_time desc limit 1 
        
           </select>

cc @SbloodyS @ruanwenjun

starrysxy · 2024-08-16T16:29:41Z

@EricGao888 Yes, like I said, #15795 and #15712 worked. But I think @SbloodyS wants to find a new way to fix this bug, which will make the task instance not bind with the workflow instance status. Am I right?

But this logic was added only in and after 3.2.2-release:
<select id="queryLastRunningProcess"...>
...

My dev branch is the lastest version, and I have also checked 3.2.2-release branch, but I can't find <select id="queryLastRunningProcess"...>.

github-actions · 2024-09-17T00:20:53Z

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions · 2024-09-26T00:25:21Z

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

starrysxy added bug Something isn't working Waiting for reply Waiting for reply labels Jul 7, 2024

SbloodyS added question Further information is requested and removed bug Something isn't working Waiting for reply Waiting for reply labels Jul 8, 2024

SbloodyS changed the title ~~[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on~~ [Question] [Master] Dependent nodes failed when the upstream rerun without the task relied on Jul 8, 2024

SbloodyS added bug Something isn't working discussion discussion priority:middle and removed question Further information is requested labels Jul 10, 2024

SbloodyS changed the title ~~[Question] [Master] Dependent nodes failed when the upstream rerun without the task relied on~~ [Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on Jul 10, 2024

SbloodyS assigned starrysxy Jul 19, 2024

github-actions bot added the Stale label Sep 17, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

starrysxy commented Jul 7, 2024 •

edited

Loading

starrysxy commented Jul 7, 2024

SbloodyS commented Jul 8, 2024

starrysxy commented Jul 8, 2024 •

edited

Loading

SbloodyS commented Jul 9, 2024

starrysxy commented Jul 9, 2024 •

edited

Loading

SbloodyS commented Jul 10, 2024

starrysxy commented Jul 10, 2024

ruanwenjun commented Jul 10, 2024

starrysxy commented Jul 15, 2024

SbloodyS commented Jul 16, 2024

starrysxy commented Jul 25, 2024

SbloodyS commented Aug 15, 2024

starrysxy commented Aug 15, 2024

SbloodyS commented Aug 15, 2024

starrysxy commented Aug 15, 2024 •

edited

Loading

SbloodyS commented Aug 15, 2024

starrysxy commented Aug 15, 2024

SbloodyS commented Aug 15, 2024

starrysxy commented Aug 15, 2024

SbloodyS commented Aug 15, 2024

EricGao888 commented Aug 16, 2024

starrysxy commented Aug 16, 2024 •

edited

Loading

github-actions bot commented Sep 17, 2024

github-actions bot commented Sep 26, 2024

[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

Comments

starrysxy commented Jul 7, 2024 • edited Loading

Search before asking

What happened

What you expected to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct

starrysxy commented Jul 7, 2024

SbloodyS commented Jul 8, 2024

starrysxy commented Jul 8, 2024 • edited Loading

SbloodyS commented Jul 9, 2024

starrysxy commented Jul 9, 2024 • edited Loading

SbloodyS commented Jul 10, 2024

starrysxy commented Jul 10, 2024

ruanwenjun commented Jul 10, 2024

starrysxy commented Jul 15, 2024

SbloodyS commented Jul 16, 2024

starrysxy commented Jul 25, 2024

SbloodyS commented Aug 15, 2024

starrysxy commented Aug 15, 2024

SbloodyS commented Aug 15, 2024

starrysxy commented Aug 15, 2024 • edited Loading

SbloodyS commented Aug 15, 2024

starrysxy commented Aug 15, 2024

SbloodyS commented Aug 15, 2024

starrysxy commented Aug 15, 2024

SbloodyS commented Aug 15, 2024

EricGao888 commented Aug 16, 2024

starrysxy commented Aug 16, 2024 • edited Loading

github-actions bot commented Sep 17, 2024

github-actions bot commented Sep 26, 2024

starrysxy commented Jul 7, 2024 •

edited

Loading

starrysxy commented Jul 8, 2024 •

edited

Loading

starrysxy commented Jul 9, 2024 •

edited

Loading

starrysxy commented Aug 15, 2024 •

edited

Loading

starrysxy commented Aug 16, 2024 •

edited

Loading