Replies: 3 comments 1 reply
-
The problem is that this is not easy, you would have to decide based on a number of available slots how many of the task_a to schedule first so that task_b have enough slots to run - it's really the question of "what you want to parallelise" and how many parallel tasks of a you allow at a time. Since you have no idea how long task_a would take, you will have to effectively reserve remaining runners for task_b when task_a complete - which means that you will have to - effectively only schedule subset of task_a even if all of them are eligible to run and you have enouhg available runners. Which I think you can achieve something very similar today. While dynamically calculating how many of task_a to schedule is difficult, but if you know how many runner slots you have, you can set For example when you have 100 runners and you have 100 mapped tasks, setting max_active_tis_per_dag = 10 will schedule max 10 a) tasks from the 100 to schedule, and subsequently when the tasks a) will be finishing, next scheduling loop will equally probably schedule b) tasks for already completed a) tasks as new a) tasks - and I believe it should behave as you expect. You will have to carefully design your limits and likely fine-tune it a bit, and it will have to be re-adjusted when you have other dags etc. but I think it's entirely doable. |
Beta Was this translation helpful? Give feedback.
-
@potiuk - Thanks very much for your quick response. In your example above, with the current behaviour, wouldn't the next batch of a) tasks get scheduled (ie. map_index 10 - 19) instead of downstream b) tasks with map_index 0-9? In the mean time, I've also been trying to get my head around depth-first execution as documented here: https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html#depth-first-execution but manually expanding each task within the TaskGroup starts to become unwieldy. |
Beta Was this translation helpful? Give feedback.
-
Thanks, I need to have a better understanding of how the scheduler/executor queue works but in this scenario, related b) tasks won't run as they are dependent on their upstream a) tasks to complete. I assume, asynchronously the next 10 of a) tasks will be scheduled to run. Anyway, thanks for confirming and suggesting a solution. I will definitely try it. I also want to test with Priority Weights to see if I can influence the running of downstream tasks over other mapped tasks. https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html |
Beta Was this translation helpful? Give feedback.
-
We continue to struggle with the current behaviour of mapped tasks within a TaskGroup. One thing I would like to discuss, which is not a bug, is the ability to control the schedule priority of dynamically mapped tasks.
Current behaviour:
Task Instances are expanded, as per the map_index, and are scheduled on a task-by-task basis. eg. If I have task_a >> task_b within a TaskGroup, all mapped tasks for task_a are scheduled first and then the downstream tasks (task_b) are triggered. If I have a pipeline with hundreds of mapped tasks, this is neither convenient nor efficient. I could see this behaviour as being useful if any preflight checks or validation need to be performed in bulk first before each pipeline is run
Proposed behaviour:
If we can control how mapped tasks are scheduled eg. horizontal versus downstream on a task-by-task basis then it would give us control over this behaviour. We could then process all preflight checks all at once, and then run each pipeline through to completion with parallelism controlled by the config accordingly.
Keen to know your thoughts in this matter.
Beta Was this translation helpful? Give feedback.
All reactions