Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes executor cleanup_stuck_queued_tasks optimization #41220

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

dirrao
Copy link
Collaborator

@dirrao dirrao commented Aug 2, 2024

Problem: Airflow running the cleanup_stuck_queued_tasks function on a certain frequency. When we run the airflow on a large Kube cluster (pods more than > 5K). Internally the cleanup_stuck_queued_tasks function loops through each queued task (when they breach task queued timeout) and checks the corresponding worker pod existence in the Kube cluster. Right now, this existence check using list pods Kube API. The API is taking more than 1s. if there are 120 queued tasks, then it will take ~ 120 seconds (1s * 120). So, this leads the scheduler to spend most of its time in this function rather than scheduling the tasks. It leads to none of the jobs being scheduled or degraded scheduler performance.

Solution: Use single k8 list pods batch api call to get all the worker pod owned by scheduler. Prepare the set of searchable strings using pod labels. Use this set data structure and identify whether the task associated pod exists or not. This reduces the number kube api sever calls significantly.

set elements string format:
(dag_id=<dag_id>,task_id=<task_id>,airflow-worker=[,map_index=<map_index>],[run_id=<run_id>]

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes provider related issues labels Aug 2, 2024
@dirrao dirrao requested a review from potiuk August 2, 2024 14:26
@dirrao dirrao force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from 0a03529 to bef1e02 Compare August 3, 2024 06:44
@dirrao dirrao closed this Aug 3, 2024
@dirrao dirrao reopened this Aug 3, 2024
@dirrao dirrao force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from b10bf25 to 4459e0f Compare August 4, 2024 05:22
@dirrao
Copy link
Collaborator Author

dirrao commented Aug 6, 2024

@jedcunningham / @hussein-awala
Can you review it whenever you are free?

@dirrao dirrao force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from 691f142 to 5b8e059 Compare August 7, 2024 12:16
@dirrao
Copy link
Collaborator Author

dirrao commented Aug 10, 2024

@jedcunningham / @hussein-awala
Can you review it whenever you are free?

@dirrao dirrao requested a review from eladkal August 10, 2024 07:00
@dirrao
Copy link
Collaborator Author

dirrao commented Aug 15, 2024

@potiuk / @eladkal
Can someone review this MR?

@eladkal eladkal force-pushed the k8s_cleanup_stuck_queued_tasks_optimization branch from 4538ec4 to 0b1d4a5 Compare August 15, 2024 10:17
@dirrao dirrao requested a review from uranusjr August 16, 2024 09:36
@potiuk
Copy link
Member

potiuk commented Aug 21, 2024

@dirrao I have very little knowiedge of those but maybe look at the history of the releavant code and ping someone who was actively implementing it before? That's the best way to find who might be good to review it rather rather than putting that on my and @eladkal shoulders?

@jedcunningham
Copy link
Member

@dirrao can you add some details in the description? Just repeating the commit message/title isn't very useful, and having to go grok 100+ lines of change to know what the goal is isn't great for reviewing now nor next year when someone is doing git blame :)

e.g. things like what is done now, what you are doing instead, expected impact.

@dirrao
Copy link
Collaborator Author

dirrao commented Aug 23, 2024

@dirrao can you add some details in the description? Just repeating the commit message/title isn't very useful, and having to go grok 100+ lines of change to know what the goal is isn't great for reviewing now nor next year when someone is doing git blame :)

e.g. things like what is done now, what you are doing instead, expected impact.

Sorry for not putting the details around the problem. I have updated the details in description of the PR.

@dirrao dirrao self-assigned this Sep 17, 2024
@bitomukesh
Copy link

bitomukesh commented Sep 21, 2024

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Feature Improvement - Optimize Kubernetes Executor's Cleanup Process

kubernetes_executor.py - Refactored cleanup_stuck_queued_tasks to use batch API calls for improved performance

CHANGELOG.rst - Added warning about removal of execution_date support for pod identification

provider.yaml - Updated version number to 9.0.0

test_kubernetes_executor.py - Updated tests to reflect changes in cleanup_stuck_queued_tasks implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants