kubernetes executor cleanup_stuck_queued_tasks optimization #41220

dirrao · 2024-08-02T14:25:57Z

Problem: Airflow running the cleanup_stuck_queued_tasks function on a certain frequency. When we run the airflow on a large Kube cluster (pods more than > 5K). Internally the cleanup_stuck_queued_tasks function loops through each queued task (when they breach task queued timeout) and checks the corresponding worker pod existence in the Kube cluster. Right now, this existence check using list pods Kube API. The API is taking more than 1s. if there are 120 queued tasks, then it will take ~ 120 seconds (1s * 120). So, this leads the scheduler to spend most of its time in this function rather than scheduling the tasks. It leads to none of the jobs being scheduled or degraded scheduler performance.

Solution: Use single k8 list pods batch api call to get all the worker pod owned by scheduler. Prepare the set of searchable strings using pod labels. Use this set data structure and identify whether the task associated pod exists or not. This reduces the number kube api sever calls significantly.

set elements string format:
(dag_id=<dag_id>,task_id=<task_id>,airflow-worker=[,map_index=<map_index>],[run_id=<run_id>]

dirrao · 2024-08-06T10:54:23Z

@jedcunningham / @hussein-awala
Can you review it whenever you are free?

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py

dirrao · 2024-08-10T07:00:46Z

@jedcunningham / @hussein-awala
Can you review it whenever you are free?

dirrao · 2024-08-15T04:59:46Z

@potiuk / @eladkal
Can someone review this MR?

potiuk · 2024-08-21T12:20:41Z

@dirrao I have very little knowiedge of those but maybe look at the history of the releavant code and ping someone who was actively implementing it before? That's the best way to find who might be good to review it rather rather than putting that on my and @eladkal shoulders?

jedcunningham · 2024-08-23T06:24:54Z

@dirrao can you add some details in the description? Just repeating the commit message/title isn't very useful, and having to go grok 100+ lines of change to know what the goal is isn't great for reviewing now nor next year when someone is doing git blame :)

e.g. things like what is done now, what you are doing instead, expected impact.

dirrao · 2024-08-23T09:07:29Z

@dirrao can you add some details in the description? Just repeating the commit message/title isn't very useful, and having to go grok 100+ lines of change to know what the goal is isn't great for reviewing now nor next year when someone is doing git blame :)

e.g. things like what is done now, what you are doing instead, expected impact.

Sorry for not putting the details around the problem. I have updated the details in description of the PR.

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py

bitomukesh · 2024-09-21T10:57:02Z

Changelist by Bito

This pull request implements the following key changes.

Key Change	Files Impacted
Feature Improvement - Optimize Kubernetes Executor's Cleanup Process	- `kubernetes_executor.py` - Refactored cleanup_stuck_queued_tasks to use batch API calls for improved performance - `CHANGELOG.rst` - Added warning about removal of execution_date support for pod identification - `provider.yaml` - Updated version number to 9.0.0 - `test_kubernetes_executor.py` - Updated tests to reflect changes in cleanup_stuck_queued_tasks implementation

dirrao requested review from jedcunningham and hussein-awala as code owners August 2, 2024 14:25

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes provider related issues labels Aug 2, 2024

dirrao requested a review from potiuk August 2, 2024 14:26

eladkal requested a review from romsharon98 August 2, 2024 14:32

dirrao force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from 0a03529 to bef1e02 Compare August 3, 2024 06:44

dirrao closed this Aug 3, 2024

dirrao reopened this Aug 3, 2024

dirrao force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from b10bf25 to 4459e0f Compare August 4, 2024 05:22

romsharon98 reviewed Aug 6, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Show resolved Hide resolved

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated Show resolved Hide resolved

dirrao force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from 691f142 to 5b8e059 Compare August 7, 2024 12:16

romsharon98 approved these changes Aug 8, 2024

View reviewed changes

dirrao requested a review from eladkal August 10, 2024 07:00

dirrao added 4 commits August 15, 2024 13:17

kubernetes executor cleanup_stuck_queued_tasks optimization

bd94267

kubernetes executor cleanup_stuck_queued_tasks optimization

183c4bb

kubernetes executor cleanup_stuck_queued_tasks optimization

b60cb4e

kubernetes executor cleanup_stuck_queued_tasks optimization

0b1d4a5

eladkal force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from 4538ec4 to 0b1d4a5 Compare August 15, 2024 10:17

Merge branch 'main' into k8s_cleanup_stuck_queued_tasks_optimization

aff270a

dirrao requested a review from uranusjr August 16, 2024 09:36

jedcunningham reviewed Sep 4, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Show resolved Hide resolved

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated Show resolved Hide resolved

dirrao requested a review from jedcunningham September 5, 2024 06:34

Updated comment

095a837

dirrao self-assigned this Sep 17, 2024

dirrao and others added 2 commits September 21, 2024 12:02

Provider change log and version updated

340ea04

Merge branch 'main' into k8s_cleanup_stuck_queued_tasks_optimization

cfd7566

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes executor cleanup_stuck_queued_tasks optimization #41220

kubernetes executor cleanup_stuck_queued_tasks optimization #41220

dirrao commented Aug 2, 2024 •

edited

Loading

dirrao commented Aug 6, 2024

dirrao commented Aug 10, 2024

dirrao commented Aug 15, 2024

potiuk commented Aug 21, 2024 •

edited

Loading

jedcunningham commented Aug 23, 2024

dirrao commented Aug 23, 2024

bitomukesh commented Sep 21, 2024 •

edited

Loading

Changelist by Bito

kubernetes executor cleanup_stuck_queued_tasks optimization #41220

Are you sure you want to change the base?

kubernetes executor cleanup_stuck_queued_tasks optimization #41220

Conversation

dirrao commented Aug 2, 2024 • edited Loading

dirrao commented Aug 6, 2024

dirrao commented Aug 10, 2024

dirrao commented Aug 15, 2024

potiuk commented Aug 21, 2024 • edited Loading

jedcunningham commented Aug 23, 2024

dirrao commented Aug 23, 2024

bitomukesh commented Sep 21, 2024 • edited Loading

Changelist by Bito

dirrao commented Aug 2, 2024 •

edited

Loading

potiuk commented Aug 21, 2024 •

edited

Loading

bitomukesh commented Sep 21, 2024 •

edited

Loading