Control schedule priority of dynamically mapped tasks within a TaskGroup #40793

gavinhonl · 2024-07-15T12:53:31Z

gavinhonl
Jul 15, 2024

We continue to struggle with the current behaviour of mapped tasks within a TaskGroup. One thing I would like to discuss, which is not a bug, is the ability to control the schedule priority of dynamically mapped tasks.

Current behaviour:
Task Instances are expanded, as per the map_index, and are scheduled on a task-by-task basis. eg. If I have task_a >> task_b within a TaskGroup, all mapped tasks for task_a are scheduled first and then the downstream tasks (task_b) are triggered. If I have a pipeline with hundreds of mapped tasks, this is neither convenient nor efficient. I could see this behaviour as being useful if any preflight checks or validation need to be performed in bulk first before each pipeline is run

Proposed behaviour:
If we can control how mapped tasks are scheduled eg. horizontal versus downstream on a task-by-task basis then it would give us control over this behaviour. We could then process all preflight checks all at once, and then run each pipeline through to completion with parallelism controlled by the config accordingly.

Keen to know your thoughts in this matter.

potiuk · 2024-07-15T13:20:51Z

potiuk
Jul 15, 2024
Collaborator

The problem is that this is not easy, you would have to decide based on a number of available slots how many of the task_a to schedule first so that task_b have enough slots to run - it's really the question of "what you want to parallelise" and how many parallel tasks of a you allow at a time.

Since you have no idea how long task_a would take, you will have to effectively reserve remaining runners for task_b when task_a complete - which means that you will have to - effectively only schedule subset of task_a even if all of them are eligible to run and you have enouhg available runners.

Which I think you can achieve something very similar today. While dynamically calculating how many of task_a to schedule is difficult, but if you know how many runner slots you have, you can set max_active_tis_per_dag to be much smaller than number of available runners - and that should work - I think pretty much as you expected.

For example when you have 100 runners and you have 100 mapped tasks, setting max_active_tis_per_dag = 10 will schedule max 10 a) tasks from the 100 to schedule, and subsequently when the tasks a) will be finishing, next scheduling loop will equally probably schedule b) tasks for already completed a) tasks as new a) tasks - and I believe it should behave as you expect.

You will have to carefully design your limits and likely fine-tune it a bit, and it will have to be re-adjusted when you have other dags etc. but I think it's entirely doable.

See: https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html#placing-limits-on-mapped-tasks

0 replies

gavinhonl · 2024-07-15T14:20:27Z

gavinhonl
Jul 15, 2024
Author

@potiuk - Thanks very much for your quick response. In your example above, with the current behaviour, wouldn't the next batch of a) tasks get scheduled (ie. map_index 10 - 19) instead of downstream b) tasks with map_index 0-9?

In the mean time, I've also been trying to get my head around depth-first execution as documented here: https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html#depth-first-execution but manually expanding each task within the TaskGroup starts to become unwieldy.

1 reply

potiuk Jul 15, 2024
Collaborator

@potiuk - Thanks very much for your quick response. In your example above, with the current behaviour, wouldn't the next batch of a) tasks get scheduled (ie. map_index 10 - 19) instead of downstream b) tasks with map_index 0-9?

No. because you set the limit per task. When you have say 100 runners and 10 a) tasks are currently running (max) and there are 10 tasks b eligible for running - those 10 b) tasks will start running (because no more a) tasks can run and there are free runners available).

And - again - this is not MY solution. This is the ONLY solution possible to your request. When you have 100 a) tasks eligible for running and 100 runners available, the ONLY way to have eligible b) tasks start running is to make sure some of the a) tasks cannot be run even if they are eligible. In this case you do it by saying (max. 10 a) tasks can run). And this is a simple, static way of saying it. You could likely come up with some complex rules which would involve knowing how your DAG dependency look like and automatically figuring out how many of a) tasks we should hold because likely there are b) tasks coming. But if you add c) and d) tasks or even c' and d' (with more complex dependency trees) - that "automatic" way of figuring out how many a) tasks we should start running out of all the eligible number of tasks is next to impossible. You would have to foresee how much each of the running tasks is likely to take, know how they depend and say ("OK I cannot schedule more a) tasks, because I need to reseve 80 runners for tasks b) c) c') and d') - that I expect will be eligible shortly, once some of the running a) tasks will complete").

So - this task is delegated to you. You know your DAG, you know what the expectation is and by controlling maximum number of instances of each of the mapped tasks, you can control that

gavinhonl · 2024-07-16T07:12:11Z

gavinhonl
Jul 16, 2024
Author

Thanks, I need to have a better understanding of how the scheduler/executor queue works but in this scenario, related b) tasks won't run as they are dependent on their upstream a) tasks to complete. I assume, asynchronously the next 10 of a) tasks will be scheduled to run.

Anyway, thanks for confirming and suggesting a solution. I will definitely try it. I also want to test with Priority Weights to see if I can influence the running of downstream tasks over other mapped tasks. https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control schedule priority of dynamically mapped tasks within a TaskGroup #40793

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Control schedule priority of dynamically mapped tasks within a TaskGroup #40793

gavinhonl Jul 15, 2024

Replies: 3 comments · 1 reply

potiuk Jul 15, 2024 Collaborator

gavinhonl Jul 15, 2024 Author

potiuk Jul 15, 2024 Collaborator

gavinhonl Jul 16, 2024 Author

gavinhonl
Jul 15, 2024

Replies: 3 comments 1 reply

potiuk
Jul 15, 2024
Collaborator

gavinhonl
Jul 15, 2024
Author

potiuk Jul 15, 2024
Collaborator

gavinhonl
Jul 16, 2024
Author